вход по аккаунту


A framework for parallel adaptive grid simulations

код для вставкиСкачать
A framework for parallel adaptive grid
Department of Computing and Information Sciences, Kansas State University, 234 Nichols Hall, Manhattan,
KS 66506, USA
Exploiting parallelism in the solution of scientific and computational engineering problems
requires significant expertise and effort on the part of application developers. We describe
a framework targeted at the class of discrete-time grid simulations that hides much of the
complexity inherent in the parallel solution of those problems. By doing so this programming
framework offers the potential to reduce developer training and software development time,
and improve the quality of the resulting simulation software. Software engineering concerns
are important; however, they need not stand in the way of high performance. Our framework is
designed to enable efficient solution to simulation problems and to allow developers to address
simulation problems that have a high degree of dynamism and irregularity. We evaluate both
the usability and performance of the framework as applied to a standard finite-difference
computation. 1997 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., Vol. 9(11), 1293–1310 (1997)
No. of Figures: 7.
No. of Tables: 0.
No. of References: 12.
Computational solution to many science and engineering problems involves representing
the problem as a grid whose point values are updated through a time simulation. At each
step t + 1 in the simulation, a point’s value is a function of its neighbor’s values for the
previous step, t. For large computationally intensive grid problems, sequential simulations
are impractical. A great deal of work on parallelizing sequential codes using sophisticated
compiler technology has been developed; however, it is limited in effectiveness when
applied to dynamic simulations based on multiple physical models. Such simulations can
be parallelized explicitly by performing data or functional decompositions of the problem.
Many of these problems are naturally partitioned into regions of a grid which may vary
in density of points, vary in time step, vary in convergence rate, vary in size and shape of
region, and vary in computational load. Variation in simulation dynamics occurs in a wide
variety of problems. For example, when doing turbulent reacting flow problems[1,2], it is
often the case that the timescales are very different in various domains of the same flow.
Using traditional time step simulations results in wasted computation time spent updating
the variables in the slow domains. Variation also occurs in large simulations whose design
∗ Correspondence to: M. B. Dwyer, Department of Computing and Information Sciences, Kansas State University, 234 Nichols Hall, Manhattan, KS 66506, USA. (e-mail:
Contract grant sponsor: NSF; Contract grant number: NSF-CDA-9617360.
Contract grant sponsor: NSF Career; Contract grant number: CCR-9703094.
CCC 1040–3108/97/111293–18$17.50
1997 John Wiley & Sons, Ltd.
Received April 1997
Revised July 1997
results from functional decomposition, where different parts of the simulation model have
vastly different simulation equations and timescales. For example, a computer model of
climate may consist of an atmospheric model, a hydrologic model, a land surface model
and an ocean model each with entirely different computations and time step granularity[3].
Developers of simulation software are already familiar with the basic components of
discrete time-stepped simulations. The desire for increased execution performance of these
codes leads to the use of parallelism. While parallelism can dramatically improve performance as well as enable solution of larger problems, the complexity and cost of developing
parallel software is high. Our approach to solving variable irregularly structured grid problems is to use parallel discrete event simulation (PDES) techniques[4]. These techniques
address several problems inherent to the synchronization of parallel computations in neighboring sub-regions of a decomposed computation. This approach:
1. employs the ‘lookahead’ concepts of PDES to allow regions to simulate at their
natural rate without excessive barrier synchronization
2. enables dynamic distribution of the regions’ computation to balance load
3. accommodates the irregular size and shape of regions through a flexible dependency
management technique.
We have defined a reusable object-oriented framework to support the construction of such
simulations in Java. We call it the parallel adaptive grid simulation (PAGS) framework.
It consists of abstractions that manage replicated worker style concurrent execution and a
workpool which encodes and manages a variety of dependencies between sub-regions of
the simulation.
Our experience working with members of the Center for Scientific Supercomputing at
Kansas State University suggests that the cost of developing parallel simulation software is
very high. In many cases, when developers have limited expertise with parallel programming technology it is determined that the risk of undertaking the development of parallel
software is too high and, consequently, the science and computational research suffers. We
believe a high-level reusable framework targeted at the domain of simulation software can
significantly reduce development costs while addressing the performance and adaptability
requirements of large, complex simulations. Our experience applying this framework bears
this out.
In the next Section, we discuss related research on parallel simulation. We then present
an overview of the programming framework that supports the construction of parallel
adaptive grid simulations in Section 3. This framework is built on top of a reusable
coordination abstraction described in Section 3.1. The other primary component of the
framework, described in Section 3.2, is a configurable workpool abstraction that is used to
encode dependences between simulation data. Section 4 illustrates the application of the
framework to the solution of a selected parallel simulation problem and Section 5 evaluates
the effectiveness of the framework for that problem. The final Section summarizes our
contributions and discusses future work.
Many science and engineering problems are described in continuous form using partial
or ordinary differential equations[3]. To solve them computationally, they are modeled
Concurrency: Pract. Exper., Vol. 9, 1293–1310 (1997)
1997 John Wiley & Sons, Ltd.
as difference equations. These difference equations are solved through a discrete time
simulation which iteratively updates grid point values for a fixed number of steps or until a
convergence condition is satisfied. At each time step, the value of a grid point is a function
of its neighboring points. In traditional sequential time step simulations, all points of one
time step must be calculated before the next time step is taken. In parallel simulation, we can
calculate many points in parallel by utilizing parallel shared-memory and/or distributedmemory MIMD computer architectures. Typically, a region of the grid is associated with a
processor, attempting to match computation grain size with communication time between
regions, and each local region computation is performed in parallel for a time step. All
processors are then synchronized and the parallel computation for the next time step, using
region values from the previous step, is performed. If some regions do not utilize the same
size of a time step, they converge at different rates, there is less computing activity within a
region, the time step within a region is variable, or the region size or shape is variable with
time, then this lock-step method can be inefficient. The PAGS framework we describe in
this paper can adapt to the problem as these parameters change.
Parallel discrete event simulations[4] collect information on the dynamic nature of a
system defined as a collection of inter-dependent simulation entities. Each entity in a
simulation system maintains its own local virtual (simulation) time (LVT). When a region
has simulated activity up to a time when it needs to pass information to its neighbors, it
sends a message stamped with its own LVT to its neighbors. In conservative PDES, each
region can only simulate up to a time for which it knows it will never receive an incoming
message with an old timestamp. Thus, an entity must acquire input from all of its neighbors
before it can simulate. The blocking that is inherent in this scheme can result in deadlock.
One method to avoid deadlock is to have regions periodically send out ‘null’ messages with
a timestamp. This tells the recipient region that it can simulate up to this timestamp before
asking for more input. To apply these ideas to traditional grid simulation problems one can
encode simulation messages with a triple of (t, b, l), where t is the region’s current LVT, b is
the border value to be passed to a neighbor, and l is the lookahead. The lookahead tells the
receiving region how far it can advance its LVT before it gets information again from the
sending region. That is, a region can predict how far it can compute before it will send new
information. This lookahead value is added to the local LVT. Thus, a region can continue
to compute as long as it has a message (border) from all neighbors whose timestamp equals
the minimum of all timestamps on incoming messages. The collection of LVT values thus
forms a global clock which drives the simulation.
The temporal dependences that are fundamental to discrete event simulation must be
honored in order to produce a correct result. So too must other problem-specific dependences, such as data dependences. There has been a considerable amount of work on
statically computing the dependence structure for a given problem, e.g. [5]. With accurate
dependence information in hand one can determine an efficient scheduling of sub-problems.
Our approach aims at finding an efficient schedule for sub-problems by maintaining a
representation of non-temporal dependences and using PDES to incorporate temporal dependence. This approach is more flexible than static approaches since the dependence
representation and lookaheads can be modified at runtime to adapt to the changing computational dynamics of a problem.
1997 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., Vol. 9, 1293–1310 (1997)
Simulation Application
Replicated Workers
Coordination Abstraction
Java (lang,util,net)
Java Virtual Machine
Distributed and/or Shared Memory
Computing Platform
Figure 1. PAGS architecture
In this Section we provide an overview of the PAGS framework. We begin by describing how
the major components of the framework inter-operate. We then provide additional details
describing the semantics and interfaces of those components. In defining this framework,
we are faced with a tension between ease of use and efficiency of implementations. Our
strategy to address these concerns has been:
1. to provide developers with an interface that provides the illusion of sequential
2. to expose implementation details that are performance critical to the developer.
In doing this, we are able to hide significant complexity from the developer while allowing
them the ability to hand-tune their application for good performance. Thus, PAGS can be
viewed as an open implementation[6] that provides a simulation problem interface and a
performance tuning interface.
3.1. Architecture
Figure 1 depicts the major components of the PAGS framework built on top of a Java
infra-structure. PAGS is designed to be configurable to take advantage of shared-memory,
distributed-memory or mixed-memory execution platforms. In distributed- and mixedmemory configurations an instance of the Java virtual machine will be active on each
The replicated workers coordination abstraction (RWCA) implements a reusable
pattern of parallel execution. The abstraction provides thread/process management, synchronization, communication and termination capabilities in a way that is independent of
application data and computation. Applications parameterize the RWCA with data and
computational elements to define complete parallel computations. An additional parameter
allows users to control the semantics of the workpool. When instantiating a coordination
Concurrency: Pract. Exper., Vol. 9, 1293–1310 (1997)
1997 John Wiley & Sons, Ltd.
abstraction application developers may choose to configure the degree of parallelism, granularity and parallel execution model to be used. Parallelism is controlled by specifying the
maximum number of worker threads/processes that may execute. Granularity is controlled
by allowing for block-transfers of multiple data elements to/from workers. Finally, the
execution model for the computation may be either shared-memory or distributed-memory.
The coordination abstraction is defined to allow for composition of instances; thus we
can have a mixed-model computation where a higher-level distributed-memory abstraction
dispatches work data to its workers, who are themselves shared-memory abstractions.
The associative workpool (AW) is an abstract data structure that is compliant with
the coordination abstractions pool interface requirements. The AW is designed to manage
dependences between sub-problems in a computation. It provides a framework within which
problem convergence, inter-region data flow and discrete simulation time are all managed.
Applications parameterize the AW with initial sub-problem values and a description of the
dependence structure of the sub-problems. These dependences can represent data, control,
causal or time-related constraints, depending on the nature of the simulation. The AW
makes minimal constraints on the problem dependence structure; it need only be defined
as a collection of sub-problem relations. One can think of the AW as a parameterizable
scheduler; once a sub-problem is determined to be ready it can be dispatched through the
RWCA to have its computation performed.
Our philosophy in defining the PAGS framework is to provide support for functionality
that is required in a broad class of parallel discrete-time simulations. There is no doubt
that special variants of PAGS could be defined to provide effective support for specific
simulation domains.
3.2. Replicated workers coordination abstraction
The notion of a coordination abstraction is very broad. For example, coordination
abstractions can be defined to capture the essence of peer-group, master–slave, heartbeat,
tree, ring and pipelined computations. In this Section, we describe one flexible coordination
abstraction that is a key component of the PAGS framework.
Collections of replicated workers[7] can be applied to solve a wide variety of problems
in parallel. Figure 2 illustrates the topology of a collection of replicated workers. All replicated workers computations share the notion of a group of computational elements, Ci ,
mapped to separate processes, called workers. Sub-problem data, Din , are fed as inputs
to the computation and are stored in a shared repository, called a workpool. Execution
proceeds as workers repeatedly consume, process and produce data elements stored in the
workpool. Conceptually, the computation terminates when all of the data elements have
been consumed, at which point the result of the computation is returned. Alternative termination semantics can also be supported; for example, one could allow individual workers
to signal for the termination of the computation depending on the results of a local computation. Worker computations in the replicated workers abstraction are typically thought
of as being identical, but this need not be the case. For example, one can define a heterogeneous collection of replicated workers where the work computation is determined by the
type of the data element being processed. Replicated worker computations can be defined
to operate in either shared-memory or distributed-memory environments. From the user’s
perspective there is little difference between the two, although design and implementation
decisions will differ in order to achieve high performance.
1997 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., Vol. 9, 1293–1310 (1997)
Figure 2.
Replicated worker abstraction
3.3. Replicated workers in Java
We describe the replicated workers abstraction as the collection of Java class and interfaces
sketched in Figure 3. A more detailed presentation of this coordination abstraction is given
in [8].
To define the meaning of the RWCA interface we map components of the abstract view of
the coordination abstraction illustrated in Figure 2 onto parts of the interface. The data, Di ,
stored in the workpool and processed by the workers are defined as a class that implements
package ca.replicatedworkers;
package ca.replicatedworkers;
public interface WorkPool f
public abstract int size();
public abstract Object take();
public abstract void add(Object o);
public nal class ReplicatedWorkers f
protected boolean done;
public ReplicatedWorkers(Conguration theCong,
WorkPool workCollection,
WorkPool resultCollection,
int numWorkers, int numItems) f...
public nal synchronized void destroy() f...
public interface Work f
public abstract boolean doWork(Vector newWork,
Vector results);
public nal synchronized void putWork(Vector v) f...
public nal synchronized void putWork(Work w) f...
public interface Result f
public abstract boolean doResult();
public nal synchronized Vector getResults() f...
public nal synchronized Result getResults() f...
public class Conguration f
public nal static int EXCLUSIVE = 1,
NONE = 3;
public nal static int SYNCHRONOUS = 1,
public int resultSemantics;
public nal synchronized void execute() f...
public nal synchronized void awaitTermination() f ...
Figure 3.
Java replicated worker interfaces and class
Concurrency: Pract. Exper., Vol. 9, 1293–1310 (1997)
1997 John Wiley & Sons, Ltd.
the Work interface. All such classes also provide the doWork method which corresponds to
the Ci in the Figure. Result data, Results, are processed independently from workpool data
and are defined by a class that implements the Result interface. These classes provide
the doResult method which corresponds to the result processing part of Ci . The pool in
the Figure is defined by a class that implements the WorkPool interface. This interface
provides the minimal capabilities needed to implement the semantics of the coordination
workers abstraction.
The operations of the ReplicatedWorkers class allow for creation, configuration and
destruction of a replicated workers computation. Such a computation can be initialized by
inserting work data into the pool via calls to putWork. After initialization the computation
can be started by calling execute. If the abstraction is configured to have synchronous
semantics, then this call will not return until the computation terminates. Under asynchronous execution semantics, execute calls return immediately and user threads can poll for
results, by calling getResults, or block for termination, by calling awaitTermination.
A computation is said to terminate if all workers are blocked waiting for more work to
process and there is no more work in the pool. Alternatively, the doWork or doResult can
return a true value which will initiate the termination of the computation. Explicit result
processing is not required by this abstraction. For example, for a NONE configuration of
resultSemantics the result WorkPool may be null and no calls to doResult will be
made. Computations of this type are free to leave results in the work WorkPool which will
be available to the application upon termination.
3.4. An associative workpool
The associative workpool is a data structure that manages dependences between subproblems in a grid simulation. Dependences can arise from a number of different sources,
including data flow and causal and temporal relationships. Data will flow across subproblem boundaries in a simulation to allow convergence of a global solution, as opposed
to a collection of local solutions. Causal orderings are induced when simulation entities
represent real-world events that stand in some cause–effect relationship. Temporal relationships are expressed in traditional grid computations as dependences across iterations that
correspond to different time steps. A correct simulation, and consequently a correct computed result, arises when all problem dependences are honored. The associative workpool
encodes dependences, dynamically tracks the state of all sub-problem dependences and releases sub-problems for computation when their dependences are honored. In this way, the
AW works much like an adaptive scheduling mechanism for sub-problem computations.
3.5. Partitions, borders and shadows
The AW takes a problem-independent view of sub-problems and their dependences. There
are three components to this view:
1. Partitions are the set of sub-problems that constitute the simulation. A partition is
considered to be a unit of data and associated computation, and implicitly defines the
finest granularity of the simulation.
2. Borders are those portions of a partition that are referenced by the computation
performed in some neighboring partition. Depending on the simulation, problem
1997 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., Vol. 9, 1293–1310 (1997)
borders may range from single ‘representative’ values to the entire partition. In
addition, we may define the portion of a partition’s border that is shared with a
designated neighbor and think of a partition as having a collection of borders.
3. Shadows are copies of neighboring border values stored locally in a partition so as
to enable complete local computation of a simulation step. Shadows are typically
defined to be easily integrated with the representation of local partition data so as to
allow seamless computation of simulation steps.
Figure 4 illustrates these components and the flow of data between them for an irregularly
shaped collection of three partitions. Dependences are expressed as a directed graph of
partitions and border. Traditional data dependences and causal dependences are expressed
as individual directed edges between a border and the partition that depends on it; temporal
dependences are handled by an explicit representation of simulation time.
Problem Domain Decomposition
Partitions, Shadows and Elements
Figure 4. Workpool components
3.6. Associative workpool in Java
We describe the associative workpool abstraction as a collection of Java classes and interfaces sketched in Figure 5. The PAGS framework provides interfaces and partially defined
compliant base classes that define the requirements of application defined partitions and
borders. The philosophy of PAGS is to leave performance-critical decisions in the hands of
the simulation developer. For this reason, users must define the concrete data types for partitions and borders. In addition, they must define the partition work method, called doWork,
and the border creation and shadow updating routines, getBorder and updateShadow.
The AssociativePool complies with the replicated workers coordination abstraction
WorkPool interface by providing add, size and take operations. Unlike common collection types an instance of the AssociativePool is immutable. Once created it manages a
fixed set of Partitions. Its collection-like operations only reflect the number of partitions
that are READY to have their local computations performed. We note that the separation
of Border and Partition types serves to implement a kind of double-buffering, thus
allowing computation on a Partition and use of its border values to update a neighbor’s
shadows to proceed in parallel. Finally, we note that the updateReadyPartitions is
Concurrency: Pract. Exper., Vol. 9, 1293–1310 (1997)
1997 John Wiley & Sons, Ltd.
package pags;
nal public class AssociativePool implements WorkPool f
private Partition[] work;
private PartitionInfo[] workInfo;
import ca.replicatedworkers.*;
public interface Indexable f...
public class IndexableImpl implements Indexable f
private int theIndex;
public void setId(int i) ftheIndex = i;g
public int Id() freturn theIndex;g
public AssociativePool(Vector v, int waterMark) f...
public nal synchronized int size() f...
public nal synchronized Object take() f...
public nal synchronized void add(Object o) f
Partition p = (Partition)o;
workInfo[p.Id()].lvt += p.minShadowLookahead();
if( waterMark > size() )
public interface Partition f...
public class PartitionImpl extends IndexableImpl
implements Partition, Work f
private int maxLA;
protected int borderLA;
protected int runTo = maxLA;
protected boolean doneComputing;
public nal synchronized void updateReadyPartitions() f
... for all partitions waiting for neighbor values ..
vEnum = current.neighbors().elements();
while ( vEnum.hasMoreElements() ) f
neigh = (Partition)vEnum.nextElement();
nId = neigh.Id();
if ( workInfo[nId].lvt >= workInfo[curId].lvt ||
workInfo[nId].status == CONVERGED ) f
g else f
currentReady = false;
public PartitionImpl(int m) fmaxLA=m;g
public boolean doWork(...
public boolean done() freturn doneComputing;g
public Vector neighbors() freturn new Vector();g;
public void setLookahead(int i) fborderLA=i;g
public Border getBorder(Partition p) freturn null;g
public int minShadowLookahead() freturn runTo;g
public void resetShadowLookahead() frunTo=maxLA;g
public void updateShadow(Border b) fg
nal class PartitionInfo f
protected int lvt;
protected int status;
protected boolean done;
protected Partition thePartition;
if (currentReady) f
workInfo[cId].status = READY;
Figure 5.
// end void updateReadyPartitions()
Java associative workpool interfaces and class
factored out of the take routine to allow updating of AssociativePool contents in parallel with Partition computations, thereby providing a mechanism for hiding workpool
In this Section we describe our experiences applying the PAGS framework to the solution
of a grid simulation. This application involves the solution of a common class of grid
problems that involve heartbeat or systolic computations. Traditionally, such problems are
solved in parallel by mapping grid-points, or clusters of points, onto processes, then configuring the processes to communicate according to the grid topology. For many problems,
however, this can lead to significant amounts of redundant and unnecessary computation.
Furthermore, when grid topology is irregular or dynamic the traditional solution requires
some modifications to avoid idle processors. Such problems are easily mapped onto the
PAGS framework. To illustrate how this can be done we describe a 4-point stencil-based
successive over-relaxation (SOR) problem which has regular, static topology. Two Java
implementations for this problem are sketched in Appendix I. The code in Section I.1 is
a sequential implementation in Java. The triply nested loop is characteristic of this kind
1997 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., Vol. 9, 1293–1310 (1997)
of convergence-based 2-dimensional finite-difference computation. The bottom of that
Section shows the implementation of the PAGS work object. It is defined to represent
dynamically sized partitions of a grid, called a SubGrid. Note that the 2-dimensional computation kernel is almost identical to the sequential version save for its loop bounds. The
outer loop in the doWork computation allows a SubGrid to execute multiple simulation
time steps in isolation; this is only allowed when its neighboring SubGrids are running
ahead in simulation time.
Section I.2 illustrates the configuration of the PAGS AssociativePool, which is constructed with a vector of SubGrids that have been initialized in terms of their neighboring
SubGrids and local data values. The replicated worker abstraction is configured to execute
synchronously with no result processing. The replicated workers object is created with
numProcessors workers, where each worker takes numIn elements of work from the pool
at a time.
In this Section, we evaluate the application of PAGS to the SOR problem. We focus both
on qualitative and quantitative aspects of our efforts. We conclude with our assessment of
the potential of Java as a platform for high-performance scientific computing.
5.1. Qualitative evaluation
The goal of PAGS is to make it easier to program parallel grid simulations relative to
alternative parallel programming technologies. There are four different aspects of this that
we consider here: learning curve for PAGS interfaces, cost of mapping applications to
PAGS, ease of tuning performance and quality of the resulting applications.
In working with the RWCA, it has been our experience that the sequential nature of the
doWork method of Work objects allows developers to understand the framework quickly.
The fact that the interfaces are small also appears to be of benefit; there are eight methods
in the ReplicatedWorkers object and a total of five methods in the three interfaces.
The primary obstacle to ease of understanding in our experience has been the meaning
of the Configuration settings. For the SOR problem this was not a problem since the
default settings were suitable. The AssociativePool introduces an additional layer of
complexity on top of the RWCA. One difficulty in using this structure is the need to define
explicit Border objects, although, for the SOR problem this was relatively straightforward.
In addition, in the current version of PAGS users must configure the partition topology
manually. This is one area for future work, since one can clearly automate the generation
of partition topology for common classes of problems, e.g. stencil problems.
We found the mapping of the SOR problem to the PAGS framework to be relatively
easy. Working from an existing threads-based implementation of this problem it took the
authors 45 min to implement, compile and run the framework-based solution sketched in
Section I.1. Thus, it appears that once the PAGS framework is understood it can effectively
be used to rapidly construct parallel versions of well-understood sequential simulations.
In the next sub-Section we discuss the effects of using the PAGS performance tuning
parameters. While all of the implementations discussed below allow for tuning, for most
of them such tuning is possible only because the implementations are monolithic. When
component-based implementations are constructed, using for example the RWCA or PAGS,
Concurrency: Pract. Exper., Vol. 9, 1293–1310 (1997)
1997 John Wiley & Sons, Ltd.
performance tuning is only possible when the components are explicitly designed to expose
performance-related attributes. The numWorker, numIn, waterMark and SubGrid.size
parameters are easily accessed and modified through PAGS interfaces.
Finally, in constructing multiple implementations of the SOR problem we learned that
the reuse of PAGS components can provide for higher-quality solutions. For example, in
one of the hand-built solutions there was a coding error that overwrote a border element
value at the wrong time. The PAGS implementation did not suffer from this problem since
the AssociativePool controls the timing of border construction after which borders are
held internally in a Partition. Another hand-built solution exhibited a deadlock; neither
the RWCA nor PAGS implementations suffered from this problem because all concurrencyrelated processing is internal to the framework components.
5.2. Quantitative evaluating
We are primarily interested in understanding the performance penalty, if any, that will
be incurred by using a problem-independent framework, like PAGS, rather than a handbuilt problem-specific solution. To do this we compare the runtimes of implementations
built with the PAGS framework and those built with different programming techniques for
the same problem. We are also interested in determining whether PAGS applications are
capable of effectively exploiting available problem parallelism as the number of processors
grows. Towards these ends we implemented six different versions of the SOR application
described in Section 4. Each of the implementations was written to accept the total problem
size, a value to control convergence, and, where appropriate, partition sizes and degree of
solution parallelism. The implementations are:
1. Sequential C – consists of a triple loop nest where the outer loop continues until
convergence and the inner pair of loops carries out a 2-dimensional relaxation step.
Plots for this implementation are denoted sequential-C.
2. Sequential Java – essentially the same as the C version. The code is given in
Section I.1. Plots for this implementation are denoted sequential-Java.
3. Hand-built thread-per-partition Java – parallelization of outermost loop with barrier
synchronization at the end of each iteration. Inner loop ranges are adjusted appropriately depending on the partition size. Plots for this implementation are denoted
4. Hand-built adaptive Java – this implementation uses a data structure that was implemented to manage the partition dependences in a 4-point stencil. As partitions
are computed their neighboring partitions may become ready for their next computation step, in which case they are dispatched to a thread and executed. Plots for this
implementation are denoted adaptive.
5. Adaptive RWCA – the data structure from the hand-built adaptive implementation
was implemented as a WorkPool and used to instantiate the RWCA. Plots for this
implementation are denoted rwca.
6. Adaptive PAGS – this is the implementation described in Section 4. We note that the
dependence management data structure in PAGS is more general than the one used
in the previous two implementations; it can handle any graph of dependences. Plots
for this implementation are denoted pags.
1997 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., Vol. 9, 1293–1310 (1997)
Each of these implementations was constructed by a developer with at least one year of
threads programming experience; the implementations each underwent a small amount of
performance tuning.
The platform we used in conducting our evaluations is a SUN Enterprise 4000 with ten
168MHz UltraSparc processors and 512Mbytes of RAM running Solaris 5.5.1. We compiled
all Java code with Toba[9] version 1.06b using the SUN C compiler; we used the same C
compiler for the C implementation. Toba translates Java code to C code and then compiles
the C code and links it with libraries that enforce Java’s runtime semantics. The version
of Toba we used provided a nearly complete native implementation of the JDK 1.0.2.
On machines running Solaris, Toba translates Java threads into SolarisThreads[10]. By
default all threads are mapped onto a single Solaris process and are scheduled internal to
that process. To take full advantage of the Enterprise multi-processor we modified thread
creation to map a single thread to a single Solaris process. Furthermore, each Solaris process
was then bound to a separate processor; if the number of processes exceeds the number
of processors we bind them in a round-robin fashion to spread processes evenly. All of
these machine-specific details are encapsulated in a runtime library and a single Java native
method. Thus, our code maintains its portability while providing for high performance in
a Solaris environment.
We ran each implementation for four different problem sizes with various partition sizes
using all ten available processors. Note that our scaling of the problem corresponds to
increasing the size of the modeled solid rather than a refinement of the simulation’s mesh.
The plots in Figure 6 give the comparative performance of the implementations. The data are
for four different problem sizes: 64 by 64, 128 by 128, 256 by 256 and 512 by 512 grids. The
sequential C version beats the sequential Java version by nearly a factor of two, primarily
because of array bounds checks in Java. More sophisticated Java compiler technology
might be able to remove such checks in cases where array bounds and index expression
ranges are statically bound, as in the case of our SOR codes. In spite of the additional cost
of bounds checking we observe that, as expected, the parallel versions beat the sequential
versions. It is also the case that as problem sizes increased the adaptive solutions appear
to be faster than the synchronous iteration solution; this is to be expected in simulations
for which only a small portion of the problem is in flux during any simulation step. Of
the adaptive versions the best performer depends on problem size. For all but the largest
problem size, the plots exhibit a characteristic ‘upward cup’ shape as partition size is varied.
This can be understood as a balancing of communication, computation and parallelism. For
overly fine partitions the overhead of communication yields reduced overall performance.
For overly coarse partitions there is insufficient parallelism and overall performance is
reduced. The troughs of the curves represent a good balance between the communication
and computation required to solve the instances of the SOR problem. PAGS provides a
simple mechanism for controlling partition size, thereby enabling application developers
to easily find a good balance of those problem-specific factors.
In these runs the PAGS version only consumed a single work element at a time. We found
that for this problem this provided the best performance. Processing of multiple independent
partitions by a worker only served to sequentialize the computation and reduce performance,
although this is not the case for all problems[11].
Concurrency: Pract. Exper., Vol. 9, 1293–1310 (1997)
1997 John Wiley & Sons, Ltd.
Elapsed Time (sec)
Elapsed Time (sec)
1997 John Wiley & Sons, Ltd.
Partition Size
Partition Size
Elapsed Time (sec)
Elapsed Time (sec)
Figure 6.
Partition Size
Comparative performance of SOR implementations
Partition Size
Concurrency: Pract. Exper., Vol. 9, 1293–1310 (1997)
Number of Workers/Processors
Figure 7.
Scaling performance
It is clear from these results that for some problems dependence-directed computation
is a significant benefit. PAGS provides support for this as well as for parallelism. To try to
assess the benefits of parallelism, as opposed to the dependence-directed feature of PAGS,
we selected good performing partition sizes and varied the degree of parallelism. Figure 7
plots the speedup vs. number of processors. Lines on the plot are annotated with the partition
size and the problem size, e.g. rwca:64,256 is for 64 by 64 partitions of a 256 by 256
grid solved with the adaptive RWCA implementation. Note that the apparent super-linear
speedup exhibited by the PAGS implementations is a result of both the dependence-directed
computation and the AssociativePool watermarks that minimize the number of pool
updates required. We note that each pool update for the PAGS pool is more expensive
than for the hand-built and RWCA data structures. Since problem dependences determine
the maximum amount of available parallelism, once that number of processors is applied
to the PAGS solution there is no further speedup. Without the AssociativePool, the
RWCA version is slower to achieve higher performance because of excessive overhead in
managing sub-problem dependences. As parallelism in the PAGS implementation nears the
available parallelism in the problem, however, the number of pool updates required for the
PAGS and RWCA versions converge. Since each update is more expensive for the PAGS
implementation, its performance lags behind the RWCA version and its speedup curve tails
5.3. Observations
We believe that the Java language can serve as a vehicle for parallel scientific computing. Java supports both shared-memory and distributed-memory models with threads and
monitor structures and the package, respectively. Native compilers for Java,
such as Toba, are becoming available with support for true-parallel execution of threaded
applications, and these compilers are rapidly maturing to provide high levels of program
Concurrency: Pract. Exper., Vol. 9, 1293–1310 (1997)
1997 John Wiley & Sons, Ltd.
optimization. To leverage investment in existing codes and to provide for language interoperability the Java designers have defined the Java Native Interface (JNI). The JNI supports
bi-directional inter-operability between Java and C/C++ program fragments. This will enable use of existing scientific libraries, such as PETSc[12], in computations embedded in
our framework, for example in doWork calls of simulation Partition objects.
Based on our experience we believe that using PAGS saves time and effort in developing
parallel simulation software. Savings are gained by reuse of coordination abstraction and
associated workpool components. One significant advantage of the object-oriented nature
of PAGS, that is not exploited for the SOR problem, is its support for heterogeneity in data
and computational components of an application. Because all of the framework components
are implemented in terms of interfaces, the actual partition data used in a simulation need
not all be of the same type. Thus, we can have widely differing concrete representations
and local computations for different sub-regions of a problem. The flexibility afforded by
this heterogeneity does not increase development effort or execution time.
We have identified a class of complex dynamic grid computations that arise in a variety
of scientific and computational engineering settings. Constructing software to implement
parallel simulation of these computations is a costly endeavor, requiring that developers
have or acquire significant experience in parallel programming technologies. Even then the
potential for introducing subtle faults into parallel software, such as deadlocks, makes a
correct parallel solution more difficult to produce than a sequential solution. Focusing on
this specific domain of discrete-time simulations enabled identification, design and implementation of a collection of reusable software components called the PAGS framework.
Our experience leads us to believe that PAGS enables high-quality simulation software
to be developed more cheaply and without an excessive negative impact on performance
relative to custom-built simulations.
PAGS has been applied to implement some well-understood grid problems with good
success. We are working to apply it to significantly more complex problems, involving:
1. dynamic size and shape changes in partitions. By monitoring the amount of computation being performed in different partitions over the course of a simulation we
can vary the size of partitions to adapt to the changing nature of the computation
wavefront. The correctness of the simulation can be assured by using lookaheads to
pause the computation in the partitions that are being refined.
2. distributed-memory and heterogeneous simulation. We have also developed a distributed-memory version of PAGS. A worker executing in such a simulation can also be a
shared-memory PAGS computation executing on a shared-memory multi-processor.
In this way, a hierarchy of distributed and shared computations can co-operate to
further improve simulation performance.
3. meta-processing of partitions. In distributed-memory PAGS, partition migration can
be used to achieve load-balancing. This same approach can be used to fault dormant
portions of the computation to disk enabling PAGS to accommodate massive-memory
4. higher-level programming support. We can provide additional abstractions layered
atop PAGS that automate the process of specifying sub-problem topology. For com1997 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., Vol. 9, 1293–1310 (1997)
mon classes of problems, such as stencil-based relaxation problems, this is straightforward. In general, topology can be extracted from sequential codes using inspector
loops whose bodies configure the PAGS framework.
To evaluate the enhanced PAGS framework we are using it to solve challenging real-world
simulation problems in the areas of non-steady state fluid dynamics, turbulent reacting flow,
and modeling of the thermo-mechanical response of soldered components.
This work was supported in part by NSF-CDA-9617360 and NSF CAREER award CCR9703094.
1. S. Oran and J. Boris, Numerical Simulation of Reactive Flow, Elsevier, New York, NY, 1987.
2. R. Fox, ‘Computational methods for turbulent reacting flows in the chemical process industry’,
Revue de l’Institut Francais du Petrole, 51(2), (1996).
3. I. Foster, Designing and Building Parallel Programs, Addison-Wesley, Reading, MA, 1995.
4. J. Misra, ‘Distributed discrete event simulation’, ACM Comput. Surv., 18(1), 29–55 (1986).
5. M. Wolfe, High Performance Compilers for Parallel Computing, Addison-Wesley, Reading,
MA, 1996.
6. G. Kiczales, ‘Beyond the black box: Open implementation’, IEEE Softw., 13(1), (1996).
7. G. Andrews, Concurrent Programming: Principles and Practice, Addison-Wesley, 1991.
8. M. Dwyer and V. Wallentine, ‘Object-oriented coordination abstractions for parallel software’, in
Proceedings of the International Conference on Parallel and Distributed Processing Techniques
and Applications (PDPTA’97), June 1997.
9. The Sumatra Project. Toba: A Java-to-C Translator,, 1997.
10. SunOS 5.2 Guide to Multi-Thread Programming, Sun Microsystems Inc., 1993.
11. M. Dwyer, M. Craig and E. Runquist, ‘An application-independent concurrency skeleton in
Ada-95’, in Proceedings of the TRI-Ada’96 Conference, December 1996.
12. W. Gropp and B. Smith, ‘Scalable, extensible, and portable numerical libraries’, in Proceedings
of Scalable Parallel Libraries Conference, IEEE, 1994, pp. 87–93.
This Appendix provides sketches of the Java source code for sequential Java and PAGSbased implementations of the SOR problem.
I.1. SOR kernels
public class SparseGrid {
public static void main( String args[] ) {
int problemDim;
float tolerance;
double[][] aGrid, bGrid, tmp;
double change, maxChange = 0;
// Read problem parameters
aGrid = new double[problemDim+2][problemDim+2];
bGrid = new double[problemDim+2][problemDim+2];
Concurrency: Pract. Exper., Vol. 9, 1293–1310 (1997)
1997 John Wiley & Sons, Ltd.
for(int i=0; i<problemDim+2; i++) {
aGrid[0][i] = 5;
bGrid[0][i] = 5;
while (maxChange >= tolerance) {
maxChange = 0;
// Compute new values into tmp
for(int i=1; i<problemDim+1; i++)
for(int j=1; j<problemDim+1; j++) {
// 4-point stencil
bGrid[i][j] = (aGrid[i-1][j]+aGrid[i+1][j]+
change = bGrid[i][j]-aGrid[i][j];
change = aGrid[i][j]-bGrid[i][j];
if( change > maxChange ) maxChange = change;
// swap tmp and subgrid values
tmp = aGrid; aGrid = bGrid; bGrid = tmp;
import ca.replicatedworkers.*;
import pags.*;
class SubGrid extends PartitionImpl
implements Partition, Work {
private SubGrid north, south, east, west;
public double[][] aGrid, bGrid;
public boolean doWork(Vector newWork,
Vector theResults) {
for (int steps=0; steps<runAhead; steps++) {
for(int i=1; i<size+1; i++)
for(int j=1; j<size+1; j++) {
// 4-point stencil
bGrid[i][j] = (aGrid[i-1][j]+aGrid[i+1][j]+
change = bGrid[i][j]-aGrid[i][j];
change = aGrid[i][j]-bGrid[i][j];
if( change > maxChange ) maxChange = change;
tmp = aGrid; aGrid = bGrid; bGrid = tmp;
doneComputing = maxChange<tolerance;
return false;
public void updateShadow(Border b) {
SubGridBorder sgb = (SubGridBorder)b;
if ( sgb.thePartition() == north )
1997 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., Vol. 9, 1293–1310 (1997)
for(int j=1; j<size; j++) aGrid[0][j]=sgb.val[j];
else if ( sgb.thePartition() == south )
for(int j=1; j<size; j++) aGrid[size+1][j]=sgb.val[j];
else if ( sgb.thePartition() == east )
for(int j=1; j<size; j++) aGrid[j][size+1]=sgb.val[j];
else if ( sgb.thePartition() == west )
for(int j=1; j<size; j++) aGrid[j][0]=sgb.val[j];
if ( runAhead>sgb.theLookahead() )
runAhead = sgb.theLookahead();
I.2. PAGS SOR driver
public class SparseGrid {
public static void main( String arg[] ) {
// read Problem parameters
// numWorkers, waterMark, numIn,
// problemSize, SubGrid.size, SubGrid.tolerance
numSubGrids = (problemSize/SubGrid.size)*(problemSize/SubGridSize);
// Create partitions
for (int i=0; i<numSubGrids; i++) {
partitions.addElement(new SubGrid(i));
// Configure partition topology
for (int i=0; i<numSubGrids; i++) {
current = (SubGrid)partitions.elementAt(i);
// Initialize partition values
for (int i=0; i<sqrtNumSG; i++)
for(int j=0; j<SubGrid.size+2; j++) {
((SubGrid)partitions.elementAt(i)).aGrid[0][j] = 5;
((SubGrid)partitions.elementAt(i)).bGrid[0][j] = 5;
// Create partition association structure
workPool = new AssociativePool(partitions, waterMark);
theConfig = new Configuration(...
theInstance = new ReplicatedWorkers(theConfig, workPool, null,
numProcessors, numIn);
Concurrency: Pract. Exper., Vol. 9, 1293–1310 (1997)
1997 John Wiley & Sons, Ltd.
Без категории
Размер файла
166 Кб
framework, parallel, simulation, adaptive, grid
Пожаловаться на содержимое документа