вход по аккаунту


Accurate performance prediction using visual prototypes

код для вставкиСкачать
Concurrency: Pract. Exper., Vol. 11(11), 615–634 (1999)
Accurate performance prediction using visual
Centre for Parallel Computing, Cavendish School of Computer Science, University of Westminster,
115 New Cavendish Street, London W1M 8JS, UK
Behavioural and performance analysis is a fundamental problem in the development of parallel
(and distributed) programs. To address this problem, models and supporting environments are
required to enable designers to build and analyse their programs. The model we put forward in
this paper combines graphical and textual representations of the program structure and uses
discrete-event simulation for performance and behaviour predictions. A graphical environment
supports our model, providing, amongst other features, a graphical editor, a simulation engine
and a performance and behaviour visualisation tool. A number of case studies using this
environment are also provided for illustration and validation of our model. Prediction errors
observed in comparisons of real execution and simulation of case studies have accuracy to
within 10%. Copyright  1999 John Wiley & Sons, Ltd.
The problem with parallel programs is that, being very complex, their behaviour is difficult
to predict through intuition alone. Thus parallel programs have much in common with other
engineering artefacts: e.g. space vehicles, buildings and electronic circuits. It should not be
surprising (although programmers may resist it) that an engineering approach to parallel
program design is not a dispensable option where satisfactory performance is required.
Engineers employ a variety of tools to the problem in hand: mathematics, modelling skills,
simulation techniques, instrumentation, measurement and visualisation through pictures.
When combined with skill and intuition, such tools enable an experienced engineer to
postulate the design of an artefact, simulate its behaviour and measure its performance (the
latter being a behaviour metric). A tool-based engineering approach is needed in parallel
Commonly, parallel program development methods start with parallelising and porting
a sequential code onto the target machine, and then running it to measure and
analyse its performance. Reparallelisation is required when the achieved performance is
unsatisfactory. This is a time-consuming process and usually entails tuning and debugging
before an acceptable performance from the parallel program is obtained. Rapid prototyping
is a useful approach to the design of (high-performance) parallel software in that complete
algorithms, outline designs or even rough schemes can be evaluated, using performance
∗ Correspondence to: G. R. Ribeiro Justo, Centre for Parallel Computing, Cavendish School of Computer Science,
University of Westminster, 115 New Cavendish Street, London W1M 8JS, UK.
Contract/grant sponsor: EPSRC PSTPA, UK; Contract/grant number: GR/K40468
Contract/grant sponsor: EC; Contract/grant number: CIPA-CT93-0251, CP-93-5383
CCC 1040–3108/99/110615–20$17.50
Copyright  1999 John Wiley & Sons, Ltd.
Received 24 March 1999
modelling, at a relatively early stage in the program development life-cycle, with respect
to possible platform configurations and mapping strategies. Modifying the platform
configurations and program-to-platform mappings permits the prototype design to be
refined, and this process may continue in an evolutionary fashion before any actual parallel
Performance modelling allows performance estimates of parallel systems to be predicted
without actually running the program on the target system itself. There are various reasons
why it may be difficult to investigate the behaviour of a system simply by conducting
live experiments, including non-availability of the system (e.g. the system may not have
been purchased, or may be running in a critical operational context), and experimental
impracticalities (e.g. lack of suitable instrumentation). A model enables the evaluation of
alternatives in a predictive ‘what if’ fashion without the need to conduct live experiments.
Modelling a system consists of abstracting its salient features of interest to the study.
Abstraction is usually necessary in order to obtain a tractable model. However, abstractions
of parallel systems may be difficult to specify as they must also be sufficiently detailed to
be accurate. The solution of a parallel system model is the system’s (predicted) behaviour,
and may be obtained by analysis or simulation. Analytical solutions are highly prized, since
they are widely applicable, but they can generally only be obtained in the simplest cases.
Solutions of any practical value must be obtained by simulation.
A range of performance statistics can be accumulated about the parallel application
program, the operating system software, the hardware, and the interactions between them.
Analysis of the raw behaviour and the derived statistics can help the designer to identify and
remedy algorithmic and architectural bottlenecks which limit the capacity of the system.
The EDPEPPS (Environment for the Design and Performance Evaluation of Portable
Parallel Software)[1] environment described in this paper supports a rapid prototyping
philosophy, based on graphical design, simulation, and visualisation. EDPEPPS supports
the development cycle ‘design–simulate–visualise’ within the same environment, as the
tools are fully integrated. This allows information about the program behaviour generated
by the simulation and visualisation tools to be related to the design. The simulation model
architecture is modular and extensible to allow modifications and change of platforms and
design as and when required. Also, the EDPEPPS environment allows generation of code
for both simulation and real execution to run on the target platform.
In this paper, we first describe a model for representing software architectures (program
global structure) aimed at architectural analysis, especially performance analysis. The idea
is to extend the usual view of architecture description with some abstract concepts of
behaviour and time. These concepts are then interpreted in an execution model which
provides behaviour and performance predictions of the software architecture.
The remainder of the paper is organised as follows. In Section 2 we review current
parallel system performance environments. In Section 3, we describe our software
architecture model with the main features of the model illustrated with examples.
In Section 4, we propose an execution model for performance analysis of software
architectures based on discrete-event simulation. Section 5 describes the main aspects of
the EDPEPPS environment, and Section 6 describes the experience and results of using
our environment with four case studies. Those case studies also illustrate the accuracy of
our simulation model in predicting performance. We finally set out our conclusions and
directions for future research in Section 7.
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 615–634 (1999)
Several parallel software modelling environments supporting performance engineering
activities have been developed recently[2] but few of them follow the prototyping approach
of EDPEPPS. In this Section, we describe the most significant of these environments and
compare them with the EDPEPPS environment.
The GRADE (GRAphical Development Environment)[3,4], part of the SEPP (Software
Engineering for Parallel Processing) toolset[5] supports the main activities in the parallel
software development cycle in a similar way to EDPEPPS. The key difference to EDPEPPS
is its graphical language, GRAPNEL[3,4], which is based on its own message-passing
interface implemented on top of PVM[6] and therefore does not model exactly the
semantics of C/PVM (unlike EDPEPPS). In addition, GRAPNEL is a ‘pure’ graphical
language in the sense that even the sequential part of each process must be described
graphically, a potentially time consuming process. In EDPEPPS, combined graphical and
textual representations are permitted.
The HAMLET environment[7] supports the development of real-time applications based
on two specific platform transputers and PowerPCs. EDPEPPS focuses on the development
of portable parallel applications based on PVM and heterogeneous workstation clusters.
HAMLET consists of a design entry system (DES), a specification simulator (HASTE), a
debugger and monitor (INQUEST), and a trace analysis tool (TATOO). However, the tools
are not tightly integrated as in the case of EDPEPPS but are applied separately. Another
limitation of HAMLET, compared with EDPEPPS, is the lack of an animation tool, which
is important for behavioural analysis of parallel applications.
N-MAP[8] proposes a performance/behaviour methodology based on the simulation
of abstract specification of program skeletons from which behaviour and performance
predictions are derived. A specification is based on three components: tasks, processes
and packets. A task refers to a sequential program segment, and the behaviour (and
requirements) of a task is expressed in units of time. Tasks are then ordered in processes
which correspond to ‘virtual processors’. The packets denote the data transferred amongst
virtual processors. The development process in N-MAP consists of providing a set of
specifications for the components above, from which traces of the program simulation
are generated. However, the specifications denote program structures which affect the
performance, and not necessarily important properties of the program design. Therefore,
although N-MAP is claimed to integrate performance and software engineering activities,
it is biased towards abstract performance engineering. This means that the prototypes are
too abstract to enable the easy derivation of a design and implementation of an application
as it is supported in EDPEPPS.
The PEPS project[9] aimed at investigating modelling, characterisation, and monitoring
of PVM programs for Transputer-based, embedded and workstation clusters platforms.
Performance modelling in PEPS focuses on the performance evaluation of computer
architectures. PEPS uses the Simulog simulation toolset (MODLINE) which offers a range
of software and hardware components. The software layer in their model is hardware
independent and based on F77. This approach is similar to EDPEPPS but our language
is based on C and our model defines only 43 instructions compared to 170 in PEPS. In
addition, the PEPS model does not take into account the cache effect whilst EDPEPPS
models both data and instruction caches[10].
The ALPSTONE project[11] supports the development of performance-oriented
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 615–634 (1999)
prototypes based on the BACS description (Basel Algorithm Classification Scheme)[12].
This is in the form of a macroscopic abstraction of program properties, such as process
topology and execution structure, data partitioning and distribution descriptions, and
interaction specifications. From this description based on program skeletons, it is possible
to generate a time model of the algorithm which allows performance estimation and
prediction of the algorithm run time on a particular system with different data and system
sizes. This project differs from EDPEPPS in three ways. Firstly, the language used to
describe the program skeletons in not graphical and too abstract, which means that the
derivation of an implementation involves several refinement steps. Secondly, the approach
to performance prediction is based on analytical modelling and not on discrete-event
simulation models, as EDPEPPS. Finally, unlike EDPEPPS, the various tools of the
environment are not fully integrated.
In spite of the increasing number of Architecture Description Languages (ADLs) currently
available, there is little consensus in the research community on what an ADL is and
what aspects of an architecture should be modelled by an ADL[13]. A framework for
classifying ADLs has been proposed by Medvidovic and Taylor[13]. In their framework,
the authors describe the components, interface, connectors and architectural configurations
as the main modelling features for an ADL. We present below our model and how it relates
to Medvidovic and Taylor’s framework.
3.1. Components
A component usually denotes a unit of computation and is, therefore, the locus of
computation and state[14]. In our model, a component roughly corresponds to a unit
of computation or a process (as we intend to model parallel and distributed systems).
In general, ADLs do not model component semantics beyond their interface (which
corresponds to the interaction points between the component and its environment as
described in the next Section). In [15], for example, the specification of the component
can be optionally defined as a composition of the CSP[16] specifications corresponding to
the protocol of each port.
The restricted semantic models of a component provided by many ADLs usually reduce
the type of analysis that can be carried out at the architectural level. In [17], we have shown
how an extended graphical description can enable us to apply deadlock-free composition
and refinement rules. In [18], the author discusses the importance of causal, timing and
even probabilistic relationships between actions to enable the specification of behaviour of
parallel and distributed systems.
The description of components in our model is divided into two main parts: the design
graphical representation (external view) and the textual representation (internal view). The
external view consists of an abstract view of components of a parallel and distributed
application design, that is the processes, their interfaces (behavioural description) and their
interactions which are represented graphically. The internal view consists of the detailed
behaviour of each process and is represented textually.
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 615–634 (1999)
Composite Component
Figure 1. Graphical and textual partial description of a component
In our model, the behaviour of a component is not formally specified, as we are more
concerned with the derivation of an implementation from the architecture, but timing
and stochastic expressions can be included in the description of a component in a Clike fashion[1]. Every graphical object corresponds to some textual representation (we
currently use C/PVM segments of code[6]). An example is shown in Figure 1, where
part of the code of component slave2 is illustrated. Note that the lines of code associated
with graphical symbols are underlined and numbered with the same number shown in the
graphical representation. The figure exemplifies the send (action number 4) and receive
(number 5) actions. The interface actions are currently described in C/PVM[6] and are
explained further in the next Section.
3.2. Interface
A component’s interface specifies the protocols a component uses to interact with its
environment. They usually denote ports through which the component can perform several
actions but little information is provided about the causal or timing constraints between the
interaction. For example, if we assume a simple architecture described in [19], illustrated
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 615–634 (1999)
Some Action Symbols
p1 B
p1 C
Figure 2. Representing component interfaces
in Figure 2(a), where component A uses a port 1 to interact with both components B and
C via their ports 1, several possible behaviours could be described. Component A could
select with which component it interacts, defined by the or relation in Figure 2(b). The
interaction between components A and B could enable the action between components A
and C, as shown in Figure 2(c). Finally, the interactions between A and components B and
C could happen independently as illustrated in Figure 2(d). So, in our model, we could use
the descriptions shown in Figure 2(b), (c) and (d) to more specifically model the interface
of component A. A partial order of the actions is defined by the numbers presented within
each action graphical notation.
Actions are classified as either input or output to denote that they require or provide
a service from/to other components, respectively. Graphical inputs are represented by
pointing in triangle-like symbols, and outputs are represented by pointing out trianglelike symbols. Unlike actions in most ADLs, however, each action is uniquely represented,
as illustrated in Figure 2.
3.3. Composite component
Composite components or configurations are the graphs of connected components that
describe the architectural structure. A composite component contains instantiations of
components and connectors (or connections) linking components’ ports. This information
is used to carry out consistency checks (such as valid connections) and to assess
architectural (global) properties[20], such as deadlock and performance. The types of
assessment will certainly depend upon the information provided and, as previously
discussed, many ADLs provide little information to allow architectural analysis.
A composite component is represented in a similar manner to an ordinary component,
as a double box. But a composite component can create instances of components. This is
done by the operation spawn. As we are dealing with parallel and distributed systems, the
location where the instance is to be created can also be represented. Instance creation is
viewed as a special action and, as it denotes control and affects behaviour, it is represented
in the interface of the composite component. In addition, as composite components are the
level at which configurations of interconnected instances are defined, for consistency they
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 615–634 (1999)
Application Layer
Message Passing Layer
Operating System Layer
Hardware Layer
Figure 3. The simulation model architecture
are usually closely associated with the configuration they create; this means composite
components cannot usually be represented in isolation. An example of a composite
component, Master2, and its configuration is presented in Figure 1. A family of instances
of a component Slave2 is created.
To analyse performance we use symbolic execution, more specifically discrete-event
simulation, to run the representation of an architecture and produce predictions about its
The simulation model consists of layers, as illustrated in Figure 3, representing specific
aspects of the system such as message passing (currently PVM), operating system (which
models process scheduling and the transport communication layer, either TCP/IP or UDP)
and hardware (which models systems’ resources, such as hosts, CPU and the Ethernet)[21].
Modularity and extensibility are key properties of the model as each layer is further
decomposed into modules, which can be easily reconfigured.
4.1. Time requirements
In many instances of design, timing requirements are critical and must be explicitly
represented, for example, to indicate intervals of time between two actions. In this case,
time intervals are represented in absolute terms. When viewing components at architectural
level, details of their computation are usually disregarded. These details, however, are
important for performance analysis and should be represented in some way[22]. A solution
is to represent a computation by the (absolute) time it takes to be executed (e.g. 0.005 s).
A more accurate representation is in terms of timing equations, which specify the time
as a function of the possible number of operations involved in the computation. The kind
of operation depends on the model used for performance analysis. In our model, timing
expressions can be specified using a cputime function which models the time taken by a
sequence (block) of basic instructions (integer and float arithmetic operations, load and
store operations, function call, etc.) including possible cache accesses (cache hit and miss
operations)[10]. The time taken by the computation also depends on the parameters set for
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 615–634 (1999)
the machine model. For example, the call
is translated into two integer-store operations with cache hit, one integer-store operation
with cache miss, three float-store operations with cache hit, one float-store operation with
cache miss, two accesses to an array, one integer addition, one float multiplication, one
exponential operation and one absolute value operation. The different kinds of operations
are ordered by position in calls to the function cputime.
The simulation model contains an execution predictor component (more details in
Section 5), which checks at simulation time, depending on the type of machine the program
is running, whether the cache is overflown or not, then it chooses the set of instructions with
cache hit or miss.
In case of prototyping, the computation can be denoted by the cputime calls at the end
of each block but conditional branches and loops should still be explicitly defined.
4.2. Probability requirements
Another important aspect of the definition of an action is whether it may or must
happen[18]. Most description languages only model the must case. However, in more
realistic models the distinction between the two relations should exist and, as suggested
by [18], one could even assign probabilities to the occurrences of actions. In our model,
we provide the function prob together with an ‘if’ to enable an action depending on the
probability of its occurrence. Other continuous and discrete distributions can be selected –
for example, uniform, exponential and Poisson[21].
4.3. Portability
To provide portability of the architectural description, the simulation model is modular
and parametrised, enabling us to use the same description to analyse performance of the
software architecture assuming different machines and operating systems by only setting
new parameters. In the current version, the environment can handle machines such as
UltraSparc 10, SUN4s, SuperSparcs and Pentium I and II[10,21].
The main tools of the EDPEPPS environment are the graphical design tool (where the
graphical and textual representation of the architecture is created), the simulator (which
accepts a description of the program and the architecture and generates trace and statistics
files) and the visualisation tool (which accepts the trace and statistics files together with
the graphical representation to provide statistical graphs and animation of the software
architecture). Figure 1 illustrates the EDPEPPS graphical editor and Figure 4 presents a
snapshot of the EDPEPPS visualisation tool.
The graphical design tool combines graphical and textual representation but the
designers usually use the graphical representation, as the text associated with the graphical
elements is automatically created. The designer may, however, enter text (C/PVM code)
directly using the text editor. To maintain consistency between the graphical and textual
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 615–634 (1999)
Figure 4. The EDPEPPS visualisation tool
representations, the text associated with a graphical symbol is protected and can only
be modified using the window related with that symbol. This enables the graphical
tool to automatically detect errors. To improve reusability the graphical and textual
representations are stored in separate files.
The visualisation tool provides two main types of support – a step-by-step animation
of the design and statistical information about the various layers of the systems –
from the application level (components and communications) to the operating system
(scheduling and communication protocols) and hardware (processor and network) level.
The important aspect is that at any time the designer can have a snapshot of the whole
system performance. Figure 4 illustrates visualisation within EDPEPPS. The main window
of the visualisation tool shows the animation of the graphical representation where on the
right side the various statistical graphics are presented. Note that during the animation
the components and their interconnections are dynamically shown, as the execution
The design and the visualisation tools are incorporated within the same GUI where the
designer can switch between these two possible modes.
Another useful component of our environment is a Reverse Engineering Tool (RET)
which allows the translation of existing parallel programss (currently C/PVM are
implemented) into graphical design representation. This is provided in order to allow
legacy parallel programs to be analysed with the toolset.
CPU characterisation tools are also an important part of the EDPEPPS environment.
These tools allow the simulator to predict the execution time of each block of sequential
statements of a component, depending on which machine the code is executed. The CPU
time characteriser consists of three tools.
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 615–634 (1999)
Design Tool
Figure 5. The EDPEPPS environment architecture
The Machine Characteriser (MC) characterises the performance of a particular machine
by benchmarking the set of instructions. Currently we selected 43 instructions based on
the High Level Language Characterisation work of Savaadra[23] but extended to include
instruction and data cache models[10].
The Program Instruction Characteriser (PIC) performs a static analysis by parsing the
source code of the program. The analysis involves counting the number of instructions in
each block and providing some parameters needed by the simulator for the cache models,
which are evaluated at run time as the parser does not know onto which machine the process
is mapped.
The Execution Predictor (EP) is responsible for the decision of whether the instruction
and data caches were used in the execution of each block depending on the size of the
instruction and the data caches of the machine in question.
Finally, a distributed debugging facility, DDBG[24], has been integrated within the
environment. Using DDBG the user is able to set break points at the graphical and textual
levels for debugging purposes. Currently, DDBG supports C/PVM.
5.1. The development process
The process starts with the graphical design tool (GDT) by building a graph representing
a parallel program design. This graph could be obtained from the reverse engineering tool
(RET) if the parallel application already exists. The graph is composed of computational
tasks and communications. The tool provides graphical representation for communication
calls, such as send and receive, which the user can select to build the required design.
The software designer can then generate (by the click of a button) real code (the current
implementation is based on C and PVM) for both simulation and real execution. Figure 5
illustrates the flow of actions when using the environment.
In the simulation path the source code obtained from the graphical design tool is
analysed by PIC to characterise the code and insert cputime calls at the end of each
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 615–634 (1999)
computational block. The instrumented source files are translated, using a tool based on the
Sage++ toolkit[25], into a queueing network representation in SES/Workbench format. The
SES/Workbench translates the graph file into the Workbench object-oriented simulation
language called SES/sim[26] using an SES utility (sestran). The sim file is then used to
generate an executable model using some SES/Workbench utilities, libraries, declarations
and the communication platform model (currently, PVM is implemented). The simulation
is based on discrete-event modelling. SES/Workbench has been used both to develop
and simulate platform models. Thus the SES/Workbench simulation engine is an intrinsic
part of the toolset. All these actions are hidden from the user and are executed from the
graphical design window by a click on the simulation button. The simulation executable is
run using three input files containing parameters concerning the target virtual environment
(e.g. number of hosts, host names, architecture, the UDP communication characteristics
and the timing costs for the set of instructions used by the EP tool[21]). The UDP model
and the instruction costs are obtained by benchmarking (benchmarks are provided off-line)
the host machines in the network.
The simulation outputs are the execution time, a trace file and a statistics file. These files
are then used by the visualisation tool in conjunction with the current loaded application
to animate the design and visualise the performance of the system. The design can be
modified and the same cycle is repeated until a satisfactory performance is achieved.
In the real execution path the C source files are generated automatically and then
compiled and executed to produce the trace file required for the visualisation/animation
process. This step can be used for validation of simulation results but only when the
target machine is accessible. The visualisation tool offers the designer graphical views
(animation) representing the execution of the designed parallel application as well as the
visualisation of its performance.
Four case studies are presented here to show various aspects of using the EDPEPPS
environment. The first, the COMMS1 benchmark, is used to test the accuracy of the
EDPEPPS communication model on its own. The second case study is the Bessel
Equation, which is a purely sequential C code to test the accuracy of the EDPEPPS CPU
characterisation toolset (Chronos). The third case study uses a prototype of the Pipeline
Processor Farm (PPF) model[27] using an image decoding application where the time
delays for the computation parts were extracted from[27]. This case study shows the
usefulness of EDPEPPS for sizing of an application at an early level of development. The
fourth case study selected is a communication intensive application based on two parallel
Givens Linear Solver methods developed in [28]. The EDPEPPS environment here should
help choose the best algorithm and predict the results of a larger network.
6.1. COMMS1 benchmark
The COMMS1 benchmark is taken from the Parkbench[29] suite (version 3.0). COMMS1
is designed to measure the communication performance of a parallel system by exchanging
of messages of various sizes between two processors. COMMS1 is selected here to
highlight the accuracy of the communication model used.
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 615–634 (1999)
Real execution
communication time (sec)
PVM message size
Figure 6. Comparison between predictions and measurements for COMMS1
Figure 6 shows the execution time results for the COMMS1 benchmark between the
predictions and the measurements (averages of 1000 iterations) for two of our machines
(a Sparc20 and a Pentium). The Figure shows a good match between the two curves. The
step-like features are caused by fragmentation of the message into 1500-byte segments at
the IP level. Note that only message sizes of up to 4 Kbytes are used as PVM fragments
messages of a larger size.
6.2. Bessel equation
The computationally intensive application chosen here to validate the CPU model is the
Bessel equation, which is a differential equation defined by
d 2y
+ (x 2 − m2 )y = 0
When m is an integer, solutions to the Bessel differential equation are given by the Bessel
function of the second kind[10], also called the Neumann function or Weber function.
The Bessel function of the second kind is translated in C and executed 100,000 times.
The results of the simulation and real execution obtained for four of our machines (A
Sparc20, a Sparc5, a Pentium and a 486) are given in Figure 7.
The average error obtained for all four machines was about 6.13%. The errors for the
486 machine are larger than those of other machines and this is probably due to the fact that
there is only one cache used for both data and instructions in the 486 machine. However,
for the superscalar machines, Pentium and SuperSparc, the results are encouraging, with
errors of only 1.7% and 3.4%, respectively.
6.3. A case study: PPF with CCITT H.261 decoder
The application chosen here is the pipeline processor farm (PPF) model of a standard image
processing algorithm, the H.261 decoder[30], proposed by Downton et al.[27]. The H.261
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 615–634 (1999)
Resolution of the Bessel equation of the second kind
Real Execution (sec)
Simulation (sec)
Difference in absolute value (%)
Time (sec)
FreeBSD 2.1
FreeBSD 2.0.5
Figure 7. Comparison between real execution and simulation for the Bessel equation
algorithm decomposes into a three-stage parallel pipeline: frame initialisation (T1); frame
decoder loop (T2) and frame output (T3). The first and last stages are inherently sequential,
whereas the middle stage contains considerable data parallelism. Thus, for example, the
PPF topology for a middle stage farm of five tasks is shown in Figure 8(a). The number
of possible topologies which solve a given problem are clearly very large, even for the
H.261 algorithm. The PPF model thus implicitly provides a rich set of experiments for
validation of the simulator. The same topological variation in the PPF model leads directly
to performance variation in the algorithm, which, typically, is only poorly understood at the
outset of design. One of the main purposes of the simulation tool in this case is to enable a
designer to identify the optimal topology, quickly and easily, without resorting to run-time
Two experiments for one and five images (frames) were carried out. The number of
processors in Stage T2 is varied from 1 to 5 (T1 and T3 were mapped on the same
processor). In every case, the load is evenly balanced between processors.
The target platform for this case study is a heterogeneous network of up to six
workstations (SUN4s. SuperSparcs and PCs). Timings for the three algorithm stages were
extracted from [27] and inserted as time delays. Figure 8(b) shows the simulated and real
experimental results for speed-up. Observe that the sequential time used to calculate the
speed-up was taken as that of the fastest machine.
As expected, the Figure shows that the 5-frame scenario performs better than the 1-frame
scenario, since the pipeline is fuller in the former case. The difference between simulated
and real speed-ups is below 10% even though the PPF simulation results do not include
packing costs.
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 615–634 (1999)
PPF SpeedUp for 5 & 1 Frames
EDPEPPS Simulator 5 Frames
Real Experiments 5 Frames
EDPEPPS Simulator 1 Frame
Real Experiments 1 Frame
Number of Processors in T2
Figure 8. (a) PPF topology for a three-stage pipeline; (b) comparison between predictions and real
experiments for PPF
6.4. The Givens linear solver
The case study selected here falls into the communication and computation intensive types
of algorithms and can be represented in the following form:
Ax = b
where A is a non-singular square matrix, b is the right-hand-side vector and x is a vector
of unknowns. The Givens rotation method selected here is particularly interesting since it
does not require pivoting – a difficult problem to parallelise – is numerically stable and is
inherently more accurate than other methods[28]. The Givens transformation is defined by
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 615–634 (1999)
a 2 × 2 rotation matrix
where c2 + s 2 = 1.
A Givens rotation is used to eliminate the elements of a vector or a matrix as follows:
−s c
where c = a/( a 2 + b2) and s = b/( a 2 + b2 ).
The Givens algorithm for solving a linear system with N equations can be decomposed
into two computational stages. The first is the triangulation of the initial matrix; this stage
is represented by the execution of the elimination block N times. The second stage is the
substitution block which solves the triangular matrix. The triangulation stage, which is the
most time consuming part (with complexity of O(N 3 ) as opposed to O(N 2 ) for the backsubstitution stage) is parallelised with two different techniques: collective and pipeline.
The back-substitution stage is also programmed differently in the two algorithms, as will
be discussed later.
Initially, block-row data decomposition, which divides the matrix horizontally and
assigns adjacent blocks of rows to neighbour processors, is used in both methods. The first
step (A) in the triangulation stage is the same for both methods. All processors eliminate
the required columns for their rows (except the first row for each processor, which will
be eliminated in the next step). Then, in the collective method, the columns of the first
rows (step B) are eliminated using collective communication (rows are collected by the
processor, which is holding the corresponding row of the current eliminated column, say
sender). The back-substitution stage is started by the last processor and its results are
broadcast to other processors, and then in a similar way all the processors solve their
triangulated matrices in a back-pipeline way.
In the pipeline method instead of using collective communications to eliminate the
columns of the first rows (step B in method 1), the row in question is passed from sender
to its neighbour to eliminate its column and then passed in a pipeline fashion through other
neighbours until all the columns are eliminated. The last processor keeps the lines in a full
matrix to be used in the back-substitution stage later on its own.
The two methods were designed using the EDPEPPS environment, and results for both
simulation and real execution (on a real network) were obtained. The network used for this
algorithm consists of three Ultra-Sparc 10 machines with 300 MHz, one Pentium II with
233 MHz, one Pentium 150 MHz, one Super-Sparc 20 with 75 MHz and one Super-Sparc
5 with 60 MHz. The machines were mapped differently in the two algorithms based on a
few simulation tests and the final optimal mapping only is used for the results shown here.
The mapping for the machines was done in increasing power order (but giving priority
to use the fastest machines if the number of processors is less than the total, seven) for
the pipeline method and in decreasing power order for the collective method. The tests
were done for problem sizes of 256 and 512 equations, but for the 256 size we did not get
any significant speedup relative to the fastest machine in the network as the algorithm is
communication intensive and this will affect the performance of small problem sizes.
Figure 9 shows the results for the simulation and real execution measurements (averages
of 10 times taken at night) for the sequential Givens on the various machines in the network
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 615–634 (1999)
Real Execution (sec)
Simulation (sec)
Difference in absolute value (%)
% difference
Time (s)
UltraSparc 10
Pentium II
300 Mhz Solaris 2.6 233Mhz FreeBSD
Pentium I
150Mhz FreeBSD
Figure 9. Comparison between predictions and measurements for sequential Givens for matrix size
512 × 512
for problem sizes of 512 × 512 (similar results were obtained but not shown here for
256 × 256). Note that the operating systems here are different from those of the Bessel
equation case study presented above and hence the machines needed to be benchmarked
again to account for that.
The results show that the average error is well below 10% for all the machines. The
Figure also shows that the maximum error was about 12% for the Pentium I, which here
runs a different version of operating system than that presented for the Bessel equation.
Note that the errors for the UltraSparc (1.5%) and Pentium II (5%) are much better than
the other slower machines. In normal cases more testing and benchmarking needed to be
done on the machines with larger errors to make sure that the parameters obtained for the
CPU instructions do not have unexpected high load which could cause some fluctuations
to the results obtained. However, for this case study we are satisfied with the current error
rate and will use the obtained parameters for the parallel algorithms.
Figure 10 shows the results for the simulation and real execution measurements
(averages of 10 times taken at night) for both the collective and the pipeline methods.
Figure 10 shows clearly that the measurements and predictions for both methods are in
good harmony with maximum error well below 10% (except for one case for the collective
algorithm with seven processors, 13%). These errors contain factors for loading on the
machines which were detected from larger times for some of the runs than others and,
although averaged out with the other measurements, still had some effects. However, all
the measurements including the benchmarks have been performed at night at low load, and
the standard deviation for all the measurements did not exceed 5% on any of the runs. As
expected, the Figure also shows the superiority of the pipeline algorithm over the collective
algorithm even for small numbers of processors. Note that the machines are heterogeneous
and adding more processors sometimes results in increasing the execution time rather
than improving it. For the 512 problem size we obtained speedups of 2 and 2.25 for
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 615–634 (1999)
Execution time (sec)
No. of Processors
Figure 10. Comparison between predictions and measurements for Parallel Givens: collective and
pipeline (note that the results displayed for one processor are for the UltraSparc 10 as it is one of
the fastest machine)
four and five processors, respectively, for the pipeline algorithm. Increasing the number of
processors above five (four for the collective) in both cases increases the execution time as
the other processors (other than the UltraSparc and Pentium II – 233 MHz) are considerably
slower than the fastest four. However, in heterogeneous networks of workstations the CPU
utilisation is another important factor which should be considered when analysing the
figures for the speedup or the execution time. This is because a network is a multi-tasking
environment and reducing the load on one fast machine, even at the expense of larger
execution time, will give a chance for other tasks queueing on that machine to be executed
faster. Knowing that the pipeline algorithm is superior to the collective one, the user can
then experiment with adding more powerful processors than available to see if better
speedups can be obtained (for example by substituting the Pentium I and less powerful
machines by UltraSparcs or Pentium IIs). However this experiment could not be validated
and was not performed here.
As for the usability of the environment as a whole, all the above mentioned case studies
were performed in tutorials by students at the MSc (Parallel and Distributed Computing)
level at the University of Westminster with little or no knowledge of PVM and no
previous experience of EDPEPPS but with some guidance from the tutor. The simulator is
transparent to the user with the only interaction through PVMGraph (the graphical design
tool) and PVMVis (the visualisation tool). The environment currently supports C and PVM
and the user needs to be familiar with the syntax of these languages. A demo of how to
build the PPF case study (shown before) in PVMGraph can be seen online in [31].
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 615–634 (1999)
The paper addresses the fundamental limitations of current visual notations for parallel
and distributed software architectures, which do not provide enough information to enable
behaviour and performance analysis. A new model based on an extended graphical and
textual description is put forward, the main features of which are the descriptions of
actions and temporal and stochastic relations between them, instead of the usual simple
interfaces denoting only inputs and outputs. We also claim that the idea of representing
each action with a unique and numbered symbol helps designers to have much insight
about the behaviour of the components at the architectural level without having to look at
the internal details of the component.
Our current environment provided by EDPEPPS is based on a performance-oriented
parallel program design method. The environment supports graphical design, performance
prediction through modelling and simulation, and visualisation of predicted program
behaviour. The designer is not required to leave the graphical design environment to view
the program’s behaviour, since the visualisation is an animation of the graphical program
description, and the transition between design and visualisation viewpoints is virtually
seamless. It is intended that this environment will encourage a philosophy of program
design, based on a rapid synthesis-evaluation design cycle, in the emerging breed of parallel
Success of the environment depends critically on the accuracy of the underlying
simulation system. Preliminary validation experiments for the PVM-based platform model
have been very encouraging, demonstrating errors between the simulation and the real
execution of around 10% for the examples presented in this paper.
Regarding future research, we are now looking at the problem of how to represent
patterns and high-level components. At the moment, components and configurations can be
easily reused by referring to their names – that is, when a component (simple or composite)
is created the tool checks if a component with that name already exists and asks if the
designer wishes to reuse it. There is not, however, a notion of constraints and templates
which are necessary for instantiating a style or pattern. We are particularly interested in
describing styles for performance critical architectures, where components must not only
satisfy behavioural constraints but also performance constraints.
Another important direction of our work is to generalise the simulation model and extend
it to support other platforms, such as MPI.
The authors wish to acknowledge F. Spies, J. Bourgeois, R. Bigeard, and F. Schinkmann
for their contribution to the development of the EDPEPPS environment. Also T. Delaitre
wishes to acknowledge his PhD supervisor S. Poslad for advice on the simulation aspects
of this work.
This project has been partially funded by the EPSRC PSTPA programme, under grant
GR/K40468, and EC contracts CIPA-CT93-0251 and CP-93-5383.
1. T. Delaitre, G. R. R. Justo, F. Spies and S. C. Winter, ‘A graphical toolset for simulation
modelling of parallel systems’, Parallel Comput., 22, 1823–1836 (1997).
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 615–634 (1999)
2. C. Pancake, M. Simmons and J. Yan, ‘Performance evaluation tools for parallel and distributed
systems, Computer 28, 16–19 (1995).
3. P. Kacsuk, G. Dozsa and T. Fadgyas, ‘Designing parallel programs by the graphical language
GRAPNEL’, Microprocess. Microprogram., 41, 625–643 (1996).
4. P. Kacsuk, J. C. Cunha, G. Dozsa, J. Lourenco, T. Fadgyas and T. Antao, ‘A graphical
development and debugging environment for parallel programs’, Parallel Comput., 22, 1747–
1770 (1997).
5. SEPP Web Site.∼sepp.
6. V. S. Sunderam, ‘PVM: a framework for parallel distributed computing, Concurrency: Pract.
Exp., 2(4), 315–339 (1990).
7. P. Pouzet, J. Paris and V. Jorrand, ‘Parallel application design: the simulation approach with
HASTE’, in W. Gentzsch and U. Harms, ed., HPCN 2, 1994, pp. 379–393.
8. A. Ferscha and J. Johnson, ‘Performance prototyping of parallel applications in N-MAP’, in
2nd Int. Conf. on Algorithms & Architectures for Parallel Processing, IEEE CS Press, June
1996, pp. 84–91.
9. PEPS Partners, ‘PEPS Bulletin, The Bulletin of the Performance Evaluation of Parallel Systems
Project’, EEC PEPS Esprit 6942, 1993.
10. J. Bourgeois, ‘CPU Modelling in EDPEPPS, EDPEPPS EPSRC Project (GR/K40468) D3.1.6
(EDPEPPS/35)’, Centre for Parallel Computing, University of Westminster, London, June
11. W. Kuhn and H. Burkhart, ‘The ALPSTONE Project: an overview of a performance modelling
environment’, in 2nd Int. Conf. on HiPC’96, McGraw Hill, 1996, pp. 491–496.
12. H. Burkhart, et al., ‘BACS: Basel algorithm classification scheme, version 1.1’, Technical
Report 93-3, Universität Basel, URZ+IFI, 1993.
13. N. Medvidovic and R. Taylor, ‘A framework for classifying and comparing architectures
description languages’, in 6th European Soft. Eng. Conf., ACM Press, Sep. 1997, pp. 60–76.
14. M. Shaw, et al., ‘Abstractions for software architecture and tools to support them’, IEEE Trans.
Softw. Eng., 21(4), 314–335 (1995).
15. R. Allen, ‘A formal approach to software architecture’, PhD thesis, School of Computer
Science, Carnegie Mellon University, May 1997.
16. C. A. R. Hoare, Communicating Sequential Processes, Prentice-Hall, 1985.
17. G. R. R. Justo and P. R. F. Cunha, ‘Framework for developing extensible and reusable parallel
and distributed applications’, in IEEE Int. Conf. on Algorithms & Architectures for Parallel
Processing, IEEE CS Press, 1996, pp. 29–36.
18. L. F. Pires, ‘Architectural notes: a framework for distributed systems development’, PhD thesis,
Centre for Telematics and Inf. Tech., The Netherlands, Sep. 1994.
19. J. Magee, N. Dulay and J. Kramer, ‘A constructive development environment for parallel
and distributed programs’, in 2nd Int. Workshop on Configurable Distributed Systems, SEI,
Carnegie Mellon University, IEEE Computer Society Press, March 1994.
20. M. Shaw and D. Garlan, Software Architecture: Perspectives on an Emerging Discipline,
Prentice Hall, 1996.
21. T. Delaitre, et al., ‘Final model definition’, EDPEPPS EPSRC Project (GR/K40468) D3.1.4
(EDPEPPS/23), Centre for Parallel Computing, University of Westminster, London, March
22. J. K. Hollingsworth and M. Steele, ‘Grindstone: a test suite for parallel performance tools’,
Technical Report UMIACS-TR-96-73, Institute of Advanced Computer Studies and Computer
Science Department, University of Maryland, College Park, 1996.
23. R. H. Saavedra-Barrera and A. J. Smith, ‘Performance characterisation of optimising
compilers’, IEEE Trans. Softw. Eng., 21(7), (1995).
24. J. C. Cunha, J. Lourenco and T. Antao, ‘A debugging engine for parallel and distributed
environment’, in Proc. DAPSYS’96, 1st Austrian-Hungarian Workshop on Distributed and
Parallel Systems, Miskolc, Hungary, 1996, pp. 111–118.
25. F. Bodin, et al., ‘Sage++: an object-oriented toolkit and class library for building Fortran and
C++ restructuring tools’, Proc. 2nd Annual Object-Oriented Numerics Conf., 1994.
26. K. Sheehan and M. Esslinger, ‘The SES/sim modeling language’, Proc. Society for Computer
Simulation, San Diego CA, July 1989, pp. 25–32.
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 615–634 (1999)
27. A. C. Downton, R. W. S. Tregidgo and A. Cuhadar, ‘Top-down structured parallelisation of
embedded image processing applications’, IEE Proc.-Vis. Image Signal Process., 141(6), 431–
437 (1994).
28. J. Papay, M. J. Zemerly and G. R. Nudd, ‘Pipelining the Givens linear solver on distributed
memory machines’, Supercomput. J., XII-3(65), 37–42.
29. R. Hockney and M. Berry, ‘Public International Benchmarks for Parallel Computers Report-1’,
Tech. Rep., Parkbench Committee, 1994.
30. CCITT, ‘Draft Revisions of Recommendation H261: video codec for audiovisual services at p
× 64 Kbit/s’, Signal Process. Image Commun., 2(2), 221–239 (1990).
31. S. Randoux,˜edpepps/demo html/ppf.html
32. A. Beguelin, J. Dongarra, A. Geist and R. Manchek, ‘HeNCE: a heterogeneous network
computing environment’, Sci. Program., 3(1), 49–60 (1994).
33. T. Bemmerl, ‘The TOPSYS architecture’, in H. Burkhart, editor, CONPAR90-VAPPIV
Conference, Zurich, Switzerland, LNCS 457, Springer, Sep. 1995, pp. 732–743.
34. R. Suppi et al., ‘Simulation in parallel software design’, Proc. Euro-PDS’97, Barcelona, 1997,
pp. 51–60.
35. N. Fang and H. Burkhart, ‘PEMPI: from MPI to standard programming environment’, in
Scalable Parallel Libraries Conference II SPLC’94, Mississippi, 1994, pp. 31–38.
36. Ian Foster, Designing and Building Parallel Programs, Addison-Wesley, 1995.
37. T. Ludwig, et al., ‘The TOOL-SET – an integrated tool environment for PVM’, in EuroPVM’95,
Lyon, France, Sep. 1995. Tech. Rep. 95-02, Ecole Normale Superieure de Lyon.
38. P. Newton and J. Dongarra, ‘Overview of VPE: A visual environment for message-passing’,
Heterogeneous Computing Workshop, 1995.
39. J. T. Stasko, ‘The PARADE environment for visualizing parallel program executions’,
Technical Report GITGVU-95-03, Graphics, Visualization and Usability Center, Georgia Inst.
of Tech., 1994.
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 615–634 (1999)
Без категории
Размер файла
212 Кб
visual, using, performance, prediction, prototype, accurate
Пожаловаться на содержимое документа