close

Вход

Забыли?

вход по аккаунту

?

469

код для вставкиСкачать
INTERNATIONAL JOURNAL FOR NUMERICAL METHODS IN ENGINEERING
Int. J. Numer. Meth. Engng. 44, 641–656 (1999)
A GENERALISED CONJUGATE RESIDUAL METHOD FOR
THE SOLUTION OF NON-SYMMETRIC SYSTEMS OF
EQUATIONS WITH MULTIPLE RIGHT-HAND SIDES
FREDERIK JAN LINGEN ∗
Koiter Institute Delft/Department of Civil Engineering; Delft University of Technology;
P.O. Box 5048; 2600 GA Delft; The Netherlands
SUMMARY
This paper presents an iterative algorithm for solving non-symmetric systems of equations with multiple righthand sides. The algorithm is an extension of the Generalised Conjugate Residual method (GCR) and combines
the advantages of a direct solver with those of an iterative solver: it does not have to restart from scratch
for every right-hand side, it tends to require less memory than a direct solver, and it can be implemented
eciently on a parallel computer. We will show that the extended GCR algorithm can be competitive with
a direct solver when running on a single processor. We will also show that the algorithm performs well on
a Cray T3E parallel computer. Copyright ? 1999 John Wiley & Sons, Ltd.
KEY WORDS:
iterative solver; non-symmetric linear system; multiple right-hand sides
1. INTRODUCTION
Most numerical methods for solving time-dependent partial dierential equations lead to a system
of equations with multiple right-hand sides. Normally, the right-hand sides are not all available
at the same time, since each right-hand side depends on the solution for the previous right-hand
side. Such a system of equations is usually solved with a direct solution method, such as LUfactorization. Direct solvers have the advantage that the coecient matrix only has to be factorized
once. After that, every right-hand side can be solved with a relatively inexpensive forward and
backward substitution. However, direct solvers may require much memory and are dicult to
implement on parallel computers. Iterative solvers on the other hand require less memory and are
relatively easy to implement on parallel computers. Unfortunately, they are not well suited for
systems of equations with multiple right-hand sides, since they must restart from scratch for every
right-hand side.
To overcome this problem, Farhat and his co-workers1 extended the Conjugate Gradient (CG)
algorithm to systems of equations with multiple right-hand sides. They have shown that the extended CG algorithm combines the advantages of a direct solver with those of an iterative solver:
it does not have to restart from scratch for every right-hand side, it usually requires less memory
∗
Correspondence to: Frederik Jan Lingen, Koiter Institute Delft / Department of Civil Engineering, Delft University of
Technology, P.O. Box 5048, 2600 GA Delft, The Netherlands. E-mail: ej.lingen@ct.tudelft.nl
CCC 0029–5981/99/050641–16$17.50
Copyright ? 1999 John Wiley & Sons, Ltd.
Received 22 December 1997
Revised 7 May 1998
642
F. J. LINGEN
than a direct solver, and it is easy to implement on a parallel computer. The extended CG algorithm,
however, can only be used when the coecient matrix is symmetric positive denite. Hence, a
dierent approach is required when the coecient matrix is non-symmetric. A promising approach
is to extend the Generalised Conjugate Residual (GCR) method in a way similar to the extension
of the CG method.
The aim of this paper is to assess how suitable the extended GCR method is for solving
non-symmetric systems of equations with multiple right-hand sides. For this purpose, we used
the extended GCR method to solve two convection–diusion problems. These two test problems
were solved both on a single processor of an SGI Origin200 and on multiple processors of a
Cray T3E. The single-processor performance of the extended GCR algorithm was also compared
with a direct solver based on LDU-factorization. In addition, the algorithm was applied to a third
convection–diusion problem to identify its limitations.
The remainder of this paper is organized as follows. First, Section 2 presents the basic GCR
algorithm for solving a system of equations with one right-hand side. Next, the GCR algorithm is
extended to multiple right-hand sides in Section 3. After that, Section 4 discusses the performance
of the extended GCR method when it is used to solve two convection–diusion problems. Next,
Section 5 points out the limitations of the extended GCR method. Finally, Section 6 presents the
conclusions.
2. THE BASIC GCR ALGORITHM
GCR2 is a Krylov method for solving a linear system of equations of the form
Ka = f
(1)
The basic GCR algorithm is shown in Figure 1. The algorithm constructs a sequence of approximate
solutions a j , residuals r j = f − Ka j , and search vectors s j and v j that are used to update the
approximate solutions and residuals. The algorithm stops iterating when the scaled residual norm
kr j k=kfk is smaller than a specied tolerance . The initial approximation a0 may be chosen
arbitrarily. A common choice is a0 = 0. Notice that the superscript j is the iteration number and
should not be confused with the subscript i that will be used for the time step number. The symbol
h:; :i denotes the scalar product between two vectors.
The search vectors v j which are generated by the GCR algorithm are orthonormal and span the
search space in which the residual vector is minimised. That is
kr j k = minj kf − Wk
W∈V
j
1
j
where V = span(v ; : : : ; v ). The GCR algorithm shown in Figure 1 relies on the modied
Gram–Schmidt procedure to make the search vectors v j orthogonal. While this procedure is more
stable than the classical Gram–Schmidt procedure, it is ill-suited for parallel computing because it
requires a large number of global communications. A more ecient orthogonalization procedure
for parallel computers is discussed in Appendix I.
When GCR is compared with the better-known GMRES method (see Reference 3), three major
dierences can be noticed. The rst dierence is that there is freedom in choosing the initial
search vector ŝ j in the GCR algorithm, which makes it more exible than GMRES. In this paper
we have adopted the common choice ŝ j = r j−1 . The second dierence is that GCR constructs
Copyright ? 1999 John Wiley & Sons, Ltd.
Int. J. Numer. Meth. Engng. 44, 641–656 (1999)
GENERALISED CONJUGATE RESIDUAL METHOD
643
Figure 1. The basic GCR algorithm
two sequences of search vectors, whereas GMRES constructs only one sequence. This means that
GCR requires twice as much vector updates as GMRES. Moreover, twice as much vectors must be
stored. The third dierence between GCR and GMRES is that GCR may break down. This happens
when kv̂ j k = 0 after the orthogonalisation procedure. It can be proven2 that this breakdown does
not occur if K is positive denite. Breakdown can be avoided by choosing a dierent initial search
vector ŝ j .
In practice, an iterative method like GCR is not used to solve the system of equations (1)
directly. Instead, this system is rst transformed to a preconditioned system of equations:
P−1 Ka = P−1 f ⇔ K̃a = f˜
By using a good preconditioner P, the number of iterations can be reduced signicantly.4
3. EXTENDING THE GCR ALGORITHM TO MULTIPLE RIGHT-HAND SIDES
In this section, the GCR algorithm will be extended to systems of equations with multiple righthand sides of the form
Kai = fi ;
i = 1; 2; : : :
(2)
We assume that the right-hand sides are not all available at the same time, but that fi+1 depends
on ai . Suppose that the rst solution a1 has been obtained with the standard GCR algorithm in
n1 iterations. In addition to the solution, GCR has also generated the search vectors v1j and s1j ,
with j = 1; : : : ; n1 . These search vectors can be used to compute an initial guess a20 for the second
Copyright ? 1999 John Wiley & Sons, Ltd.
Int. J. Numer. Meth. Engng. 44, 641–656 (1999)
644
F. J. LINGEN
Figure 2. By re-using the search vectors s11 and s12 , only one additional search vector s21 is required to compute the second
solution a2
solution as follows:
a20 = 0;
r20 = f2 ;
for j = 1; : : : ; n1
= hr20 ; v1j i;
a20 = a20 + s1j ;
r20 = r20 − v1j
endfor
Next, we apply the basic GCR algorithm with one modication: the search vectors v2j that are
generated for the second solution are also orthogonalised with respect to the n1 search vectors that
have been generated for the rst solution. In this way, the size of the search space is enlarged
with the previously generated search vectors, and fewer iterations will be required to achieve
convergence. This is illustrated in Figure 2, which shows two successive solutions of a system of
equations with three degrees of freedom. After the rst solution a1 has been computed, the two
search vectors s11 and s12 have been constructed. Hence, the computation of the second solution a2
requires only the construction of one additional search vector s21 . Note that the search vectors sij
are orthogonal in this gure. In reality this will not be the case, since only the search vectors vij
are orthogonalised.
The procedure described above can be easily generalised to an arbitrary number of right-hand
sides. First, an initial guess ai0 for the ith solution is computed by using the search vectors that
have been generated for the previous solutions. Next, the modied GCR algorithm is applied to
compute the solution ai . Since each new search vector vij is also orthogonalized with respect to
the search vectors that have been generated for the previous solutions, the size of the search space
of the GCR algorithms increases with each right-hand side that is solved. As a consequence, the
number of iterations per right-hand will decrease.
Figure 3 shows the extended GCR algorithm for computing the ith solution of the system of
equations (2). To make the algorithm easier to understand, all search vectors v j and s j have
been numbered from 1 to ni . That is, the subscript i that denotes the solution number, has been
dropped. The integer ni is the total number of search vectors that have been generated up to the
i-th right-hand side. By denition, n0 = 0. Notice that the initial search vector ŝm has been taken
equal to the residual rij−1 , as in Section 2.
Copyright ? 1999 John Wiley & Sons, Ltd.
Int. J. Numer. Meth. Engng. 44, 641–656 (1999)
GENERALISED CONJUGATE RESIDUAL METHOD
645
Figure 3. The extended GCR algorithm for computing the ith solution of equation (2)
The amount of work and the amount of memory that is required are proportional to the total
number of search vectors that have been generated. To make the extended GCR algorithm competitive with direct solution algorithms, it is essential to keep the total number of generated search
vectors as small as possible. This can be achieved by preconditioning the system of equations (2)
as shown in Section 2.
Another way to reduce the amount of work and the memory usage, is to use the extended
GCR algorithm in combination with the Schur-complement, or sub-structuring, approach. In this
approach, the computational domain of the partial dierential equation is decomposed into a number
of sub-domains, and the corresponding system of equations is reduced to a much smaller system
of equations which is dened only on the interfaces between the sub-domains. In this way, the
search vectors become smaller, so that they require less memory. In addition, the orthogonalisation
procedure will require less computational work. For details, the reader is referred to Farhat and
his co-workers,1 or Smith et al.5
Copyright ? 1999 John Wiley & Sons, Ltd.
Int. J. Numer. Meth. Engng. 44, 641–656 (1999)
646
F. J. LINGEN
4. PERFORMANCE OF THE EXTENDED GCR ALGORITHM
The performance of the extended GCR algorithm was assessed by applying it to two time-dependent
convection–diusion problems: one 2-D problem and one 3-D problem. These test problems were
solved both on a single processor computer and on a parallel computer. On the single processor
computer, a direct solver based on LDU-factorization was also used.
The extended GCR algorithm was implemented within the nite element package ScaFiEP.6
ScaFiEP runs both on sequential and parallel computers, and it provides high-level routines for
performing sparse matrix–vector products and for orthogonalising a vector with respect to a sequence of other vectors. Hence, implementing the extended GCR algorithm was straightforward.
4.1. The two test problems
Both test problems are based on the following time-dependent (dimensionless) convection–
diusion equation:
@u(x; t)
= D∇2 u(x; t) − hq; ∇u(x; t)i + s(x; t)
@t
u(x; t) = g(x; t) for x on d
hn; ∇u(x; t)i = (x; t)
for x on
for x in (3)
n
where D is the diusivity, ∇ is the gradient operator, q is the ow, and s is a source. Both the
diusivity and the ow are assumed to be constant. denotes the domain of interest, d is the
boundary of where Dirichlet boundary conditions are prescribed, and n is the boundary where
Neumann boundary conditions are prescribed. The vector n is the normal vector on n . The source
s is assumed to be time-independent and is given by
s(x) = s0 e−kx−xs k
2
=R2
(4)
Thus, the source has a spherical distribution with magnitude s0 , centre xs and radius R.
To obtain a numerical solution of the convection–diusion equation, it is discretized with the
nite element method. This yields the following system of equations:
M
da(t)
= −Ka(t) + f(t)
dt
(5)
where M is the mass matrix and K is the stiness matrix, which is non-symmetric. The vector
a consist of the (approximate) values of the state u within the nodes of the nite element mesh.
The vector f contains the Neumann boundary conditions, and the contribution of the source s.
The components of a which correspond with the nodes on the boundary d contain the prescribed
Dirichlet boundary conditions. When the time-derivative in the above expression is replaced by a
backward dierence equation, the following system of equations is obtained:
1
M + K)ai+1 =
( t
1
t Mai
+ fi+1
(6)
where t is the size of a time step, and ai ≡ a(i · t). This equation is of the same form as
equation (2).
A 2-D convection–diusion problem constitutes the rst test problem. This problem has a rectangular domain with corner points (0; 0) and (80; 20). At the boundaries x = 0 and x = 80, the state
Copyright ? 1999 John Wiley & Sons, Ltd.
Int. J. Numer. Meth. Engng. 44, 641–656 (1999)
GENERALISED CONJUGATE RESIDUAL METHOD
647
Figure 4. The nite element mesh of the 2-D test problem
u has the prescribed value zero. At the other two boundaries y = 0 and y = 20, the y-derivative
of u is zero, that is hn; ∇ui = 0. The state is zero within the entire rectangle at time t = 0.
The diusivity D = 0·001, and the ow qT = (1; 0). Thus, the ow is parallel to the x-axis. The
magnitude of the source s0 = 1, its centre xsT = (20; 14) and its radius R = 5. This test problem
is discretized with quadratic triangular elements. A coarse version of the resulting nite element
mesh is shown in Figure 4. The actual mesh consists of 4235 nodes and 2038 elements.
The second test problem is a 3-D convection–diusion problem. In this case, the domain is an
elongated box with corner points (0; 0; 0) and (80; 20; 20). At the boundaries x = 0 and x = 80,
u = 0, and at the other boundaries hn; ∇ui = 0. Again, u = 0 within the entire box at time t = 0.
We adopt qT = (1; 0; 0), which means that the ow is parallel to the x-axis. The diusivity and the
source are the same as in the 2-D problem, except that the centre of the source xsT = (20; 14; 14).
A coarse version of the nite element mesh is shown in Figure 5. The actual mesh consists of
43 841 nodes and 29 550 quadratic tetrahedral elements.
4.2. The single-processor performance of the extended GCR algorithm
The single-processor performance of the extended GCR algorithm was assessed by solving both
test problems on one processor of an SGI Origin200. To reduce the number of iterations and
increase the eciency of the GCR algorithm, we used the Neumann–Neumann preconditioner
described by Smith5 in combination with the Schur-complement approach. The number of subdomains was set to 12 for the 2-D test problem, and to 32 for the 3-D test problem. For both test
problems, the convergence tolerance of the GCR solver was set to = 10−5 . The size of each
time step was set to t = 1.
The results that were obtained for the 2-D test problem are summarised in Figures 6 and 7. The
rst gure shows the number of extended GCR iterations per time step. Apparently, the number
of iterations decreases rapidly from 14 to 3 within the initial eight time steps. After that, the
number of iterations decreases slowly to two iterations, and in some of the last time steps only
one iteration is required. Since the number of search vectors that are generated increases with each
time step, the memory requirements of the extended GCR will also increase. In the rst time step,
13 MB of memory were required to store all the data, including the preconditioner. In the last time
step, 2 × 115 = 230 search vectors had been generated, and the memory usage was 14 MB. This
is slightly more than the 13 MB that was required for a direct solver.
The extended GCR algorithm requires only one or two iterations in later time steps, because
the successive solutions are very similar (see Section 5). The small number of iterations is not
the result of reaching the steady state solution, because it takes 60 time steps for a particle to
travel from the centre of the source to the outow boundary. Consequently, it would take at least
60 time steps to reach the steady-state solution.
Copyright ? 1999 John Wiley & Sons, Ltd.
Int. J. Numer. Meth. Engng. 44, 641–656 (1999)
648
F. J. LINGEN
Figure 5. The nite element mesh of the 3-D test problem
Figure 6. Number of GCR-iterations per time step for the 2-D test problem
Figure 7 shows the time that was required to solve the system of equations within each time step.
We observe that the solution time per time step of the extended GCR algorithm decreases with a
factor 2·3 within the rst eight time steps. This is less than the relative decrease of the number of
iterations per time step, because the orthogonalisation of the search vectors constitutes a signicant
Copyright ? 1999 John Wiley & Sons, Ltd.
Int. J. Numer. Meth. Engng. 44, 641–656 (1999)
GENERALISED CONJUGATE RESIDUAL METHOD
649
Figure 7. The solution times per time step when the 2-D test problem was solved with the extended GCR algorithm, and
with a forward and backward substitution
fraction of the solution time. Figure 7 also shows that a direct solver is much faster for the 2-D
test problem than the GCR solver: the time required for a forward and backward substitution is
only 0·055 s. The total execution time of the direct solver was 3·9 s, while the extended GCR
algorithm required 10 s, which is 2·6 times longer. Although the extended GCR algorithm is much
slower than a direct solver, it is still markedly faster than a conventional iterative solver: when the
2-D test problem was solved with GMRES, it took 2·3 times as long to solve all 40 time steps,
namely 23 s.
The graphs in Figures 8 and 9 show the results that were obtained for the 3-D test problem.
The rst graph, which represents the number of GCR-iterations per time step, is very similar to
the graph in Figure 6. Again, the number of iterations decreases rapidly in the rst few time steps,
and decreases slowly in the remaining time steps. The memory that was required to store all data
increased from 220 MB in the rst time step, to 290 MB in the last time step when 2 × 203 = 406
search vectors had been stored. This is still 44 per cent less than the 520 MB that was required
by the direct solver.
The second graph shows that the solution time per time step of the extended GCR solver
decreases with a factor 3·4 in the rst eight time steps. This factor is better in agreement with the
decrease in the number of iterations per time step (a factor 5·2), because now the orthogonalisation
of the search vectors constitutes a smaller fraction of the solution time per time step. The second
graph also shows that the solution time per time step of the extended GCR solver is almost the
same as the time required for a forward and backward substitution with a direct solver. Overall, the
GCR solver was faster than the direct solver, because the factorization of the coecient matrix
required a large amount of time. In total, the extended GCR solver required 12 min to set up
the Schur-complement and the preconditioner and to solve all time steps, while the direct solver
required 22 min to create the LDU-factorization and solve all 40 time steps with a forward and
backward substitution. If the number of time steps increases, however, the factorization of the
coecient matrix will constitute a smaller fraction of the total computation time, and the direct
solver will become faster than the GCR solver. Solving the 3-D test problem with GMRES took
37 min, which is 3·0 times longer than it took with the extended GCR algorithm.
Copyright ? 1999 John Wiley & Sons, Ltd.
Int. J. Numer. Meth. Engng. 44, 641–656 (1999)
650
F. J. LINGEN
Figure 8. Number of GCR-iterations per time step for the 3-D test problem
Figure 9. The solution times per time step when the 3-D test problem was solved with the extended GCR algorithm, and
with a forward and backward substitution
4.3. The parallel performance of the extended GCR algorithm
To assess if the extended GCR algorithm can be used eciently on a parallel computer, both
test problems were also solved on a Cray T3E. Since our implementation of the extended GCR
algorithm is based on the domain decomposition approach, we partitioned the two nite element
meshes into a number of sub-domains. The number of sub-domains of each mesh was xed, so
that the total execution time depended only on the number of processors that were used. As a
consequence, the number of sub-domains per processor varied. To ensure that the work load did
not vary too much between the processors, each processor was assigned the same number of subdomains. Hence, the number of sub-domains had to be equal to, or a multiple of the number of
Copyright ? 1999 John Wiley & Sons, Ltd.
Int. J. Numer. Meth. Engng. 44, 641–656 (1999)
GENERALISED CONJUGATE RESIDUAL METHOD
651
Figure 10. The speedup for the 2-D test problem on a Cray T3E. Both the speedup with input and output (I/O) and the
speedup without I/O are shown
processor that were used. The number of sub-domains for the 2-D test problem was xed at 12,
and for the 3-D test problem it was xed at 32. Note that these numbers are the same as used on
the Origin200 for testing the single-processor performance.
Figure 10 shows the speedup for the 2-D test problem as function of the number of processors
that were used. The speedup is dened as the ratio between the execution time on one processor
and the execution time on p processors. In the ideal case, the speedup is equal to the number of
processors that are used. In practice, the speedup will be lower because of load imbalance between
the processors, inter-processor communication overhead, and input and output (I/O) overhead (the
time required to read the input data from disk and write the results to disk).
The gure shows both the speedup when the I/O is included in the execution time, and the
speedup when the I/O is excluded from the execution time. The speedup without I/O is signicantly
higher than the speedup without I/O, since the I/O constitutes a relatively large fraction of the
total execution time—up to 43 per cent when 12 processors are used. The speedup without I/O
is still lower than the ideal speedup, largely because of inter-processor communication overhead.
This is not surprising since the total execution time without I/O is only 1·7 s on 12 processors.
Figure 11 shows the speedup that was obtained for the 3-D test problem. To solve the 3-D
problem, at least four processors were required to store all the data. Hence, the speedup is now
dened as the ratio between the execution time on four processors and the execution time on
p processors. The speedup without I/O is now much closer to the ideal speedup, because the
inter-processor communication overhead constitutes a smaller fraction of the total execution time.
It is still lower than the ideal speedup, largely because of load imbalance between the processors.
This load imbalance is caused by dierences in the geometries of the sub-domains. As a result,
the time required to set up and apply the Schur-complement and the preconditioner varies between
the processors.
The total execution time (including I/O) of the extended GCR algorithm decreases from 270 s
on 4 processors to 53 s on 32 processors. When the 3-D test problem was solved with GMRES
on 32 processors it took 2·5 times longer than the extended GCR algorithm, namely 130 s.
Copyright ? 1999 John Wiley & Sons, Ltd.
Int. J. Numer. Meth. Engng. 44, 641–656 (1999)
652
F. J. LINGEN
Figure 11. The speedup for the 3-D test problem on a Cray T3E. Both the speedup with input and output (I/O) and the
speedup without I/O are shown
As the above results show, the extended GCR algorithm can perform well on a parallel computer, especially when the system of equations is relatively large. The algorithm is also easy to
implement on a parallel computer, since its main operations are a matrix–vector product and the
orthogonalization of the search vectors. Both these operations are straightforward to parallelize
within the domain decomposition approach.
5. LIMITATIONS OF THE EXTENDED GCR ALGORITHM
The extended GCR algorithm works well for the two test problems because the solutions for two
successive right-hand sides are very similar. Consequently, the solution for the new right-hand side
will have a large component which lies in the space spanned by the previously generated search
vectors. This is sketched in Figure 12, which shows two contour lines of two successive solutions
of the 2-D test problem. Because the two solutions largely overlap—indicated by the dark gray
area—it is possible to construct a good approximation of the solution ai+1 by using the previously
generated search vectors.
The extended GCR algorithm does not perform as well when the solution for one right-hand
side has a large component which lies outside the space spanned by the previously generated
search vectors. To illustrate this, the extended GCR algorithm was used to compute the steadystate solution of the 2-D test problem for dierent positions xs of the centre of the source (see
equation (4)). The centre of the source was moved in 11 steps along the y-axis, so that in the ith
step the centre was positioned at (20; 2(i − 1)). The radius of the source was reduced to R = 1 to
ensure that the source at step i did not overlap with the sources at steps i − 1 and i + 1.
The steady-state solutions for dierent positions of the source were obtained by solving the
following systems of equations:
Kai = fi
Copyright ? 1999 John Wiley & Sons, Ltd.
Int. J. Numer. Meth. Engng. 44, 641–656 (1999)
GENERALISED CONJUGATE RESIDUAL METHOD
653
Figure 12. Schematic representation of the contour lines of two successive solutions of the 2-D test problem. The overlap
between the two solutions is indicated by the dark gray area
Figure 13. Schematic representation of the contour lines of the steady state solutions for two successive positions of the
source. The overlap between the solutions is indicated by the dark gray area
where the matrix K is the same as in equation (5), and the vector fi contains the contributions
of the source. Because the mass matrix M is no longer present, it was necessary to increase the
diusivity to a value of 0·1 to avoid unrealistic oscillations in the solutions.
Figure 13 sketches the contour lines of the steady-state solution for two successive positions
of the source. Because the overlap between the two solutions is small—the dark gray area—it
is not possible to construct a good approximation of the solution ai+1 by using the previously
generated search vectors. This is conrmed in Figure 14, which shows the number of extended
GCR iterations for each position of the source. Apparently, the number of iterations decreases
slowly, compared with Figure 6.
6. CONCLUSIONS
The extended GCR algorithm can be an ecient method for solving non-symmetric systems of
equations with multiple right-hand sides. On a single processor, it depends on the problem that is
solved whether the extended GCR algorithm is a viable alternative to a direct solver. For instance,
when the extended GCR algorithm was used to solve a 3-D convection–diusion problem on one
processor of an SGI Origin200, it was competitive with a direct solver based on LDU-factorization.
To be precise, the extended GCR algorithm was 1·8 times faster and required 44 per cent less
Copyright ? 1999 John Wiley & Sons, Ltd.
Int. J. Numer. Meth. Engng. 44, 641–656 (1999)
654
F. J. LINGEN
Figure 14. Number of GCR-iterations that are required to obtain the steady-state solution for dierent positions of the
source
memory than the direct solver. However, when the extended GCR algorithm was used to solve
a 2-D test problem, it performed less than the direct solver. The GCR algorithm was 2·6 times
slower and required the same amount of memory.
The extended GCR algorithm is especially suited for parallel computers, because it is relatively
easy to implement, and it has the potential to scale well. For example, when the extended GCR
algorithm was run on a Cray T3E, the solution time of the 3-D test problem decreased by a
factor 6·7 when 8 times as many processors were used. When the extended GCR algorithm was
used to solve the 2-D test problem, however, the solution time decreased only by a factor 8·4 when
12 times as many processors were used. In this case the inter-processor communication constituted
a signicant fraction of the total solution time.
Although the extended GCR algorithm requires more memory than a conventional iterative
solver, such as GMRES, it can be signicantly faster. When GMRES was used to solve the
two test problems on the Origin200, it took 2·3 times longer for the 2-D test problem, and 3·0
times longer for the 3-D test problem. In some specic cases, however, the extended GCR
algorithm will not perform much better than a conventional iterative solver. This was conrmed by
applying the extended GCR algorithm to another, specially designed, convection–diusion
problem.
APPENDIX
An ecient orthogonalization procedure for parallel computers
An important part of the GCR algorithm—and of many other iterative solution algorithms—is
the orthogonalization of a new search vector v̂ j with respect to the previous, orthonormal, search
Copyright ? 1999 John Wiley & Sons, Ltd.
Int. J. Numer. Meth. Engng. 44, 641–656 (1999)
GENERALISED CONJUGATE RESIDUAL METHOD
655
vectors v k ; k = 1; : : : ; j − 1. This is usually done with the modied Gram–Schmidt procedure:
for k = 1; : : : ; j − 1
= hv̂ j ; v k i
v̂ j = v̂ j − v k
endfor
v j = v̂ j
where v j denotes the new, orthogonal, search vector. Note that in the GCR algorithm, this vector
must also be normalized. For simplicity sake, we will disregard the normalization step. While the
modied Gram–Schmidt procedure is numerically stable, it performs poorly on a parallel computer
because the j − 1 scalar products depend one each other, that is, the kth scalar product cannot
be performed before the (k − 1)th product has been completed. As a consequence, the modied
Gram–Schmidt procedure requires j − 1 global communications between the processors.
An alternative procedure is classical Gram–Schmidt:
for k = 1; : : : ; j − 1
k = hv̂ j ; v k i
endfor
for k = 1; : : : ; j − 1
v̂ j = v̂ j − k v k
endfor
v j = v̂ j
Since the scalar products in the rst part of the algorithm are independent of each other, this
procedure requires only one global communication. However, classical Gram–Schmidt is not numerically stable. To improve its stability, it is possible to apply classical Gram–Schmidt more than
once, as was shown by Bjorck.7
In this research, we adopted another orthogonalization procedure that is at least as stable as
modied Gram–Schmidt and requires only two, maybe three, global communications. This procedure, which is also briey mentioned by Bjorck, makes use of the Conjugate Gradient method.
To explain the procedure, we write the new, orthogonalized, search vector as
v j = v̂ j − Vj q j
where Vj is a non-rectangular matrix. It has j − 1 columns which are equal to the previous,
orthonormal, search vectors v k ; k = 1; : : : ; j − 1. The vector q j has length j − 1 and is the solution
of the following least-squares problem:
kv̂ j − Vj q j k
min
j
q
The vector q j can be computed eciently by applying the Conjugate Gradient (CG) method to
the system of normal equations
(Vj )T Vj q j = (Vj )T v̂ j
Since the columns of the matrix Vj are (nearly) orthonormal, the coecient matrix (Vj )T Vj is
very close to the identity matrix. Consequently, only a very few CG iterations are required to
compute a suciently accurate solution.
Copyright ? 1999 John Wiley & Sons, Ltd.
Int. J. Numer. Meth. Engng. 44, 641–656 (1999)
656
F. J. LINGEN
Within each iteration, the CG method requires a matrix–vector product of the form y =
(Vj )T Vj x, which can be performed eciently in two steps. First, compute z = Vj x, then compute
y = (Vj )T z. A practical implementation of these two steps is given below:
z=0
for k = 1; : : : ; j − 1
z = z + xk v k
endfor
for k = 1; : : : ; j − 1
yk = hv k ; zi
endfor
where xk and yk denote the kth component of x and y, respectively. The rst step requires no
communication between the processors. The second step requires only one global communication,
because all the scalar products are independent of each other.
ACKNOWLEDGEMENTS
I would like to thank Rene de Borst for reading an early version of this paper, and for giving
valuable feedback. I would also like to thank the centre for High Performance Applied Computing
at the Delft University of Technology for providing computation time on their Cray T3E.
REFERENCES
1. C. Farhat, L. Crivelli and F. X. Roux, ‘Extending substructure based iterative solvers to multiple load and repeated
analyses’, Comput. Meth. Appl. Mech. Engng., 117, 195–209 (1994).
2. S. C. Eisenstat, H. C. Elman and M. H. Schultz, ‘Variational iterative methods for nonsymmetric systems of linear
equations’, SIAM J. Numer. Anal., 20(2), 345–357 (1983).
3. Y. Saad and M. H. Schultz, ‘GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems’,
SIAM J. Sci. Statist. Comput., 7(3), 856–869, (1986).
4. Y. Saad, Iterative Methods for Sparse Linear Systems, PWS Publishing Company, Boston, 1996.
5. B. Smith, P. Bjrstad and W. Gropp, Domain Decomposition. Parallel Multilevel Methods for Elliptic Partial
Dierential Equations, Cambridge University Press, Cambridge, 1996.
6. E. J. Lingen, ‘Design and implementation of a nite element package for parallel computers’, in Proc. 2nd Int. DIANA
Conf. Computational Mechanics, Amsterdam, The Netherlands, June 1997.
7. A. Bjorck, ‘Numerics of Gram-Schmidt orthogonalization’, Linear Algebra Appl., 197, 198, 279–316 (1994).
Copyright ? 1999 John Wiley & Sons, Ltd.
Int. J. Numer. Meth. Engng. 44, 641–656 (1999)
Документ
Категория
Без категории
Просмотров
5
Размер файла
305 Кб
Теги
469
1/--страниц
Пожаловаться на содержимое документа