journal article
LitStream Collection
doi: 10.1002/(SICI)1096-9128(199609)8:7<499::AID-CPE230>3.0.CO;2-1pmid: N/A
Large scale scientific computing necessitates finding a way to match the high level understanding of how a problem can be solved with the details of its computation in a processing environment organized as networks of processors. Effective utilization of parallel architectures can then be achieved by using formal methods to describe both computations and computational organizations within these networks. By returning to the mathematical treatment of a problem as a high level numerical algorithm we can express it as an algorithmic formalism that captures the inherent parallelism of the computation. We then give a meta description of an architecture followed by the use of transformational techniques to convert the high level description into a program that utilizes the architecture effectively. The hope is that one formalism can be used to describe both computations as well as architectures and that a methodology for automatically transforming computations can be developed. The formalism and methodology presented in the paper is a first step toward the ambitious goals described above. It uses a theory of arrays, the Psi calculus, as the formalism, and two levels of conversions—one for simplification and another for data mapping.
Choi, Jaeyoung; Dongarra, Jack J.; Walker, David W.
doi: 10.1002/(SICI)1096-9128(199609)8:7<517::AID-CPE226>3.0.CO;2-Wpmid: N/A
We propose a new software package which would be very useful for implementing dense linear algebra algorithms on block‐partitioned matrices. The routines are referred to as block basic linear algebra subprograms (BLAS), and their use is restricted to computations in which one or more of the matrices involved consists of a single row or column of blocks, and in which no more than one of the matrices consists of an unrestricted two‐dimensional array of blocks. The functionality of the block BLAS routines can also be provided by Level 2 and 3 BLAS routines. However, for non‐uniform memory access machines the use of the block BLAS permits certain optimizations in memory access to be taken advantage of. This is particularly true for distributed memory machines, for which the block BLAS are referred to as the parallel block basic linear algebra subprograms (PB‐BLAS). The PB‐BLAS are the main focus of this paper, and for a block‐cyclic data distribution, in a single row or column of blocks lies in a single row or column of the processor template.
Hui, Chi‐Chung; Hamdi, Mounir; Ahmad, Ishfaq
doi: 10.1002/(SICI)1096-9128(199609)8:7<537::AID-CPE225>3.0.CO;2-Xpmid: N/A
Distributed systems such as networks of workstations are becoming an increasingly viable alternative to traditional supercomputer systems for running complex scientific applications. A large number of these applications require solving sets of partial differential equations (PDEs). In this paper, we describe the implementation and performance of SPEED (Scalable Partial differential Equation Environment on Distributed systems), a parallel platform which provides an efficient solution for time‐dependent PDEs. SPEED allows the inclusion of a wide range of parameters and programming aids. PVM is employed as the underlying message‐passing system. The parallel implementation has been performed using two algorithms. The first algorithm is a two‐phase scheme which uses the conventional technique of alternating phases of computation and communication. The second algorithm employs a pre‐computation technique that allows overlapping of computation and communication. Both methods yield significant speedups. The pre‐computation technique reduces the communication time between the workstations but incurs additional overhead in buffer management. Hence, if the saving in communication time is larger than the overhead, the pre‐computation technique outperforms the two‐phase algorithm. SPEED also provides a performance prediction methodology that can accurately predict the performance of a given application on the system before running the application. This methodology allows the user to tune various parameters in order to identify system bottlenecks and maximize the performance.
Showing 1 to 4 of 4 Articles