INTERNATIONAL JOURNAL FOR NUMERICAL METHODS IN ENGINEERING Int. J. Numer. Meth. Engng. 44, 641–656 (1999) A GENERALISED CONJUGATE RESIDUAL METHOD FOR THE SOLUTION OF NON-SYMMETRIC SYSTEMS OF EQUATIONS WITH MULTIPLE RIGHT-HAND SIDES FREDERIK JAN LINGEN ∗ Koiter Institute Delft/Department of Civil Engineering; Delft University of Technology; P.O. Box 5048; 2600 GA Delft; The Netherlands SUMMARY This paper presents an iterative algorithm for solving non-symmetric systems of equations with multiple righthand sides. The algorithm is an extension of the Generalised Conjugate Residual method (GCR) and combines the advantages of a direct solver with those of an iterative solver: it does not have to restart from scratch for every right-hand side, it tends to require less memory than a direct solver, and it can be implemented eciently on a parallel computer. We will show that the extended GCR algorithm can be competitive with a direct solver when running on a single processor. We will also show that the algorithm performs well on a Cray T3E parallel computer. Copyright ? 1999 John Wiley & Sons, Ltd. KEY WORDS: iterative solver; non-symmetric linear system; multiple right-hand sides 1. INTRODUCTION Most numerical methods for solving time-dependent partial dierential equations lead to a system of equations with multiple right-hand sides. Normally, the right-hand sides are not all available at the same time, since each right-hand side depends on the solution for the previous right-hand side. Such a system of equations is usually solved with a direct solution method, such as LUfactorization. Direct solvers have the advantage that the coecient matrix only has to be factorized once. After that, every right-hand side can be solved with a relatively inexpensive forward and backward substitution. However, direct solvers may require much memory and are dicult to implement on parallel computers. Iterative solvers on the other hand require less memory and are relatively easy to implement on parallel computers. Unfortunately, they are not well suited for systems of equations with multiple right-hand sides, since they must restart from scratch for every right-hand side. To overcome this problem, Farhat and his co-workers1 extended the Conjugate Gradient (CG) algorithm to systems of equations with multiple right-hand sides. They have shown that the extended CG algorithm combines the advantages of a direct solver with those of an iterative solver: it does not have to restart from scratch for every right-hand side, it usually requires less memory ∗ Correspondence to: Frederik Jan Lingen, Koiter Institute Delft / Department of Civil Engineering, Delft University of Technology, P.O. Box 5048, 2600 GA Delft, The Netherlands. E-mail: ej.lingen@ct.tudelft.nl CCC 0029–5981/99/050641–16$17.50 Copyright ? 1999 John Wiley & Sons, Ltd. Received 22 December 1997 Revised 7 May 1998 642 F. J. LINGEN than a direct solver, and it is easy to implement on a parallel computer. The extended CG algorithm, however, can only be used when the coecient matrix is symmetric positive denite. Hence, a dierent approach is required when the coecient matrix is non-symmetric. A promising approach is to extend the Generalised Conjugate Residual (GCR) method in a way similar to the extension of the CG method. The aim of this paper is to assess how suitable the extended GCR method is for solving non-symmetric systems of equations with multiple right-hand sides. For this purpose, we used the extended GCR method to solve two convection–diusion problems. These two test problems were solved both on a single processor of an SGI Origin200 and on multiple processors of a Cray T3E. The single-processor performance of the extended GCR algorithm was also compared with a direct solver based on LDU-factorization. In addition, the algorithm was applied to a third convection–diusion problem to identify its limitations. The remainder of this paper is organized as follows. First, Section 2 presents the basic GCR algorithm for solving a system of equations with one right-hand side. Next, the GCR algorithm is extended to multiple right-hand sides in Section 3. After that, Section 4 discusses the performance of the extended GCR method when it is used to solve two convection–diusion problems. Next, Section 5 points out the limitations of the extended GCR method. Finally, Section 6 presents the conclusions. 2. THE BASIC GCR ALGORITHM GCR2 is a Krylov method for solving a linear system of equations of the form Ka = f (1) The basic GCR algorithm is shown in Figure 1. The algorithm constructs a sequence of approximate solutions a j , residuals r j = f − Ka j , and search vectors s j and v j that are used to update the approximate solutions and residuals. The algorithm stops iterating when the scaled residual norm kr j k=kfk is smaller than a specied tolerance . The initial approximation a0 may be chosen arbitrarily. A common choice is a0 = 0. Notice that the superscript j is the iteration number and should not be confused with the subscript i that will be used for the time step number. The symbol h:; :i denotes the scalar product between two vectors. The search vectors v j which are generated by the GCR algorithm are orthonormal and span the search space in which the residual vector is minimised. That is kr j k = minj kf − Wk W∈V j 1 j where V = span(v ; : : : ; v ). The GCR algorithm shown in Figure 1 relies on the modied Gram–Schmidt procedure to make the search vectors v j orthogonal. While this procedure is more stable than the classical Gram–Schmidt procedure, it is ill-suited for parallel computing because it requires a large number of global communications. A more ecient orthogonalization procedure for parallel computers is discussed in Appendix I. When GCR is compared with the better-known GMRES method (see Reference 3), three major dierences can be noticed. The rst dierence is that there is freedom in choosing the initial search vector ŝ j in the GCR algorithm, which makes it more exible than GMRES. In this paper we have adopted the common choice ŝ j = r j−1 . The second dierence is that GCR constructs Copyright ? 1999 John Wiley & Sons, Ltd. Int. J. Numer. Meth. Engng. 44, 641–656 (1999) GENERALISED CONJUGATE RESIDUAL METHOD 643 Figure 1. The basic GCR algorithm two sequences of search vectors, whereas GMRES constructs only one sequence. This means that GCR requires twice as much vector updates as GMRES. Moreover, twice as much vectors must be stored. The third dierence between GCR and GMRES is that GCR may break down. This happens when kv̂ j k = 0 after the orthogonalisation procedure. It can be proven2 that this breakdown does not occur if K is positive denite. Breakdown can be avoided by choosing a dierent initial search vector ŝ j . In practice, an iterative method like GCR is not used to solve the system of equations (1) directly. Instead, this system is rst transformed to a preconditioned system of equations: P−1 Ka = P−1 f ⇔ K̃a = f˜ By using a good preconditioner P, the number of iterations can be reduced signicantly.4 3. EXTENDING THE GCR ALGORITHM TO MULTIPLE RIGHT-HAND SIDES In this section, the GCR algorithm will be extended to systems of equations with multiple righthand sides of the form Kai = fi ; i = 1; 2; : : : (2) We assume that the right-hand sides are not all available at the same time, but that fi+1 depends on ai . Suppose that the rst solution a1 has been obtained with the standard GCR algorithm in n1 iterations. In addition to the solution, GCR has also generated the search vectors v1j and s1j , with j = 1; : : : ; n1 . These search vectors can be used to compute an initial guess a20 for the second Copyright ? 1999 John Wiley & Sons, Ltd. Int. J. Numer. Meth. Engng. 44, 641–656 (1999) 644 F. J. LINGEN Figure 2. By re-using the search vectors s11 and s12 , only one additional search vector s21 is required to compute the second solution a2 solution as follows: a20 = 0; r20 = f2 ; for j = 1; : : : ; n1 = hr20 ; v1j i; a20 = a20 + s1j ; r20 = r20 − v1j endfor Next, we apply the basic GCR algorithm with one modication: the search vectors v2j that are generated for the second solution are also orthogonalised with respect to the n1 search vectors that have been generated for the rst solution. In this way, the size of the search space is enlarged with the previously generated search vectors, and fewer iterations will be required to achieve convergence. This is illustrated in Figure 2, which shows two successive solutions of a system of equations with three degrees of freedom. After the rst solution a1 has been computed, the two search vectors s11 and s12 have been constructed. Hence, the computation of the second solution a2 requires only the construction of one additional search vector s21 . Note that the search vectors sij are orthogonal in this gure. In reality this will not be the case, since only the search vectors vij are orthogonalised. The procedure described above can be easily generalised to an arbitrary number of right-hand sides. First, an initial guess ai0 for the ith solution is computed by using the search vectors that have been generated for the previous solutions. Next, the modied GCR algorithm is applied to compute the solution ai . Since each new search vector vij is also orthogonalized with respect to the search vectors that have been generated for the previous solutions, the size of the search space of the GCR algorithms increases with each right-hand side that is solved. As a consequence, the number of iterations per right-hand will decrease. Figure 3 shows the extended GCR algorithm for computing the ith solution of the system of equations (2). To make the algorithm easier to understand, all search vectors v j and s j have been numbered from 1 to ni . That is, the subscript i that denotes the solution number, has been dropped. The integer ni is the total number of search vectors that have been generated up to the i-th right-hand side. By denition, n0 = 0. Notice that the initial search vector ŝm has been taken equal to the residual rij−1 , as in Section 2. Copyright ? 1999 John Wiley & Sons, Ltd. Int. J. Numer. Meth. Engng. 44, 641–656 (1999) GENERALISED CONJUGATE RESIDUAL METHOD 645 Figure 3. The extended GCR algorithm for computing the ith solution of equation (2) The amount of work and the amount of memory that is required are proportional to the total number of search vectors that have been generated. To make the extended GCR algorithm competitive with direct solution algorithms, it is essential to keep the total number of generated search vectors as small as possible. This can be achieved by preconditioning the system of equations (2) as shown in Section 2. Another way to reduce the amount of work and the memory usage, is to use the extended GCR algorithm in combination with the Schur-complement, or sub-structuring, approach. In this approach, the computational domain of the partial dierential equation is decomposed into a number of sub-domains, and the corresponding system of equations is reduced to a much smaller system of equations which is dened only on the interfaces between the sub-domains. In this way, the search vectors become smaller, so that they require less memory. In addition, the orthogonalisation procedure will require less computational work. For details, the reader is referred to Farhat and his co-workers,1 or Smith et al.5 Copyright ? 1999 John Wiley & Sons, Ltd. Int. J. Numer. Meth. Engng. 44, 641–656 (1999) 646 F. J. LINGEN 4. PERFORMANCE OF THE EXTENDED GCR ALGORITHM The performance of the extended GCR algorithm was assessed by applying it to two time-dependent convection–diusion problems: one 2-D problem and one 3-D problem. These test problems were solved both on a single processor computer and on a parallel computer. On the single processor computer, a direct solver based on LDU-factorization was also used. The extended GCR algorithm was implemented within the nite element package ScaFiEP.6 ScaFiEP runs both on sequential and parallel computers, and it provides high-level routines for performing sparse matrix–vector products and for orthogonalising a vector with respect to a sequence of other vectors. Hence, implementing the extended GCR algorithm was straightforward. 4.1. The two test problems Both test problems are based on the following time-dependent (dimensionless) convection– diusion equation: @u(x; t) = D∇2 u(x; t) − hq; ∇u(x; t)i + s(x; t) @t u(x; t) = g(x; t) for x on d hn; ∇u(x; t)i = (x; t) for x on for x in (3) n where D is the diusivity, ∇ is the gradient operator, q is the ow, and s is a source. Both the diusivity and the ow are assumed to be constant. denotes the domain of interest, d is the boundary of where Dirichlet boundary conditions are prescribed, and n is the boundary where Neumann boundary conditions are prescribed. The vector n is the normal vector on n . The source s is assumed to be time-independent and is given by s(x) = s0 e−kx−xs k 2 =R2 (4) Thus, the source has a spherical distribution with magnitude s0 , centre xs and radius R. To obtain a numerical solution of the convection–diusion equation, it is discretized with the nite element method. This yields the following system of equations: M da(t) = −Ka(t) + f(t) dt (5) where M is the mass matrix and K is the stiness matrix, which is non-symmetric. The vector a consist of the (approximate) values of the state u within the nodes of the nite element mesh. The vector f contains the Neumann boundary conditions, and the contribution of the source s. The components of a which correspond with the nodes on the boundary d contain the prescribed Dirichlet boundary conditions. When the time-derivative in the above expression is replaced by a backward dierence equation, the following system of equations is obtained: 1 M + K)ai+1 = ( t 1 t Mai + fi+1 (6) where t is the size of a time step, and ai ≡ a(i · t). This equation is of the same form as equation (2). A 2-D convection–diusion problem constitutes the rst test problem. This problem has a rectangular domain with corner points (0; 0) and (80; 20). At the boundaries x = 0 and x = 80, the state Copyright ? 1999 John Wiley & Sons, Ltd. Int. J. Numer. Meth. Engng. 44, 641–656 (1999) GENERALISED CONJUGATE RESIDUAL METHOD 647 Figure 4. The nite element mesh of the 2-D test problem u has the prescribed value zero. At the other two boundaries y = 0 and y = 20, the y-derivative of u is zero, that is hn; ∇ui = 0. The state is zero within the entire rectangle at time t = 0. The diusivity D = 0·001, and the ow qT = (1; 0). Thus, the ow is parallel to the x-axis. The magnitude of the source s0 = 1, its centre xsT = (20; 14) and its radius R = 5. This test problem is discretized with quadratic triangular elements. A coarse version of the resulting nite element mesh is shown in Figure 4. The actual mesh consists of 4235 nodes and 2038 elements. The second test problem is a 3-D convection–diusion problem. In this case, the domain is an elongated box with corner points (0; 0; 0) and (80; 20; 20). At the boundaries x = 0 and x = 80, u = 0, and at the other boundaries hn; ∇ui = 0. Again, u = 0 within the entire box at time t = 0. We adopt qT = (1; 0; 0), which means that the ow is parallel to the x-axis. The diusivity and the source are the same as in the 2-D problem, except that the centre of the source xsT = (20; 14; 14). A coarse version of the nite element mesh is shown in Figure 5. The actual mesh consists of 43 841 nodes and 29 550 quadratic tetrahedral elements. 4.2. The single-processor performance of the extended GCR algorithm The single-processor performance of the extended GCR algorithm was assessed by solving both test problems on one processor of an SGI Origin200. To reduce the number of iterations and increase the eciency of the GCR algorithm, we used the Neumann–Neumann preconditioner described by Smith5 in combination with the Schur-complement approach. The number of subdomains was set to 12 for the 2-D test problem, and to 32 for the 3-D test problem. For both test problems, the convergence tolerance of the GCR solver was set to = 10−5 . The size of each time step was set to t = 1. The results that were obtained for the 2-D test problem are summarised in Figures 6 and 7. The rst gure shows the number of extended GCR iterations per time step. Apparently, the number of iterations decreases rapidly from 14 to 3 within the initial eight time steps. After that, the number of iterations decreases slowly to two iterations, and in some of the last time steps only one iteration is required. Since the number of search vectors that are generated increases with each time step, the memory requirements of the extended GCR will also increase. In the rst time step, 13 MB of memory were required to store all the data, including the preconditioner. In the last time step, 2 × 115 = 230 search vectors had been generated, and the memory usage was 14 MB. This is slightly more than the 13 MB that was required for a direct solver. The extended GCR algorithm requires only one or two iterations in later time steps, because the successive solutions are very similar (see Section 5). The small number of iterations is not the result of reaching the steady state solution, because it takes 60 time steps for a particle to travel from the centre of the source to the outow boundary. Consequently, it would take at least 60 time steps to reach the steady-state solution. Copyright ? 1999 John Wiley & Sons, Ltd. Int. J. Numer. Meth. Engng. 44, 641–656 (1999) 648 F. J. LINGEN Figure 5. The nite element mesh of the 3-D test problem Figure 6. Number of GCR-iterations per time step for the 2-D test problem Figure 7 shows the time that was required to solve the system of equations within each time step. We observe that the solution time per time step of the extended GCR algorithm decreases with a factor 2·3 within the rst eight time steps. This is less than the relative decrease of the number of iterations per time step, because the orthogonalisation of the search vectors constitutes a signicant Copyright ? 1999 John Wiley & Sons, Ltd. Int. J. Numer. Meth. Engng. 44, 641–656 (1999) GENERALISED CONJUGATE RESIDUAL METHOD 649 Figure 7. The solution times per time step when the 2-D test problem was solved with the extended GCR algorithm, and with a forward and backward substitution fraction of the solution time. Figure 7 also shows that a direct solver is much faster for the 2-D test problem than the GCR solver: the time required for a forward and backward substitution is only 0·055 s. The total execution time of the direct solver was 3·9 s, while the extended GCR algorithm required 10 s, which is 2·6 times longer. Although the extended GCR algorithm is much slower than a direct solver, it is still markedly faster than a conventional iterative solver: when the 2-D test problem was solved with GMRES, it took 2·3 times as long to solve all 40 time steps, namely 23 s. The graphs in Figures 8 and 9 show the results that were obtained for the 3-D test problem. The rst graph, which represents the number of GCR-iterations per time step, is very similar to the graph in Figure 6. Again, the number of iterations decreases rapidly in the rst few time steps, and decreases slowly in the remaining time steps. The memory that was required to store all data increased from 220 MB in the rst time step, to 290 MB in the last time step when 2 × 203 = 406 search vectors had been stored. This is still 44 per cent less than the 520 MB that was required by the direct solver. The second graph shows that the solution time per time step of the extended GCR solver decreases with a factor 3·4 in the rst eight time steps. This factor is better in agreement with the decrease in the number of iterations per time step (a factor 5·2), because now the orthogonalisation of the search vectors constitutes a smaller fraction of the solution time per time step. The second graph also shows that the solution time per time step of the extended GCR solver is almost the same as the time required for a forward and backward substitution with a direct solver. Overall, the GCR solver was faster than the direct solver, because the factorization of the coecient matrix required a large amount of time. In total, the extended GCR solver required 12 min to set up the Schur-complement and the preconditioner and to solve all time steps, while the direct solver required 22 min to create the LDU-factorization and solve all 40 time steps with a forward and backward substitution. If the number of time steps increases, however, the factorization of the coecient matrix will constitute a smaller fraction of the total computation time, and the direct solver will become faster than the GCR solver. Solving the 3-D test problem with GMRES took 37 min, which is 3·0 times longer than it took with the extended GCR algorithm. Copyright ? 1999 John Wiley & Sons, Ltd. Int. J. Numer. Meth. Engng. 44, 641–656 (1999) 650 F. J. LINGEN Figure 8. Number of GCR-iterations per time step for the 3-D test problem Figure 9. The solution times per time step when the 3-D test problem was solved with the extended GCR algorithm, and with a forward and backward substitution 4.3. The parallel performance of the extended GCR algorithm To assess if the extended GCR algorithm can be used eciently on a parallel computer, both test problems were also solved on a Cray T3E. Since our implementation of the extended GCR algorithm is based on the domain decomposition approach, we partitioned the two nite element meshes into a number of sub-domains. The number of sub-domains of each mesh was xed, so that the total execution time depended only on the number of processors that were used. As a consequence, the number of sub-domains per processor varied. To ensure that the work load did not vary too much between the processors, each processor was assigned the same number of subdomains. Hence, the number of sub-domains had to be equal to, or a multiple of the number of Copyright ? 1999 John Wiley & Sons, Ltd. Int. J. Numer. Meth. Engng. 44, 641–656 (1999) GENERALISED CONJUGATE RESIDUAL METHOD 651 Figure 10. The speedup for the 2-D test problem on a Cray T3E. Both the speedup with input and output (I/O) and the speedup without I/O are shown processor that were used. The number of sub-domains for the 2-D test problem was xed at 12, and for the 3-D test problem it was xed at 32. Note that these numbers are the same as used on the Origin200 for testing the single-processor performance. Figure 10 shows the speedup for the 2-D test problem as function of the number of processors that were used. The speedup is dened as the ratio between the execution time on one processor and the execution time on p processors. In the ideal case, the speedup is equal to the number of processors that are used. In practice, the speedup will be lower because of load imbalance between the processors, inter-processor communication overhead, and input and output (I/O) overhead (the time required to read the input data from disk and write the results to disk). The gure shows both the speedup when the I/O is included in the execution time, and the speedup when the I/O is excluded from the execution time. The speedup without I/O is signicantly higher than the speedup without I/O, since the I/O constitutes a relatively large fraction of the total execution time—up to 43 per cent when 12 processors are used. The speedup without I/O is still lower than the ideal speedup, largely because of inter-processor communication overhead. This is not surprising since the total execution time without I/O is only 1·7 s on 12 processors. Figure 11 shows the speedup that was obtained for the 3-D test problem. To solve the 3-D problem, at least four processors were required to store all the data. Hence, the speedup is now dened as the ratio between the execution time on four processors and the execution time on p processors. The speedup without I/O is now much closer to the ideal speedup, because the inter-processor communication overhead constitutes a smaller fraction of the total execution time. It is still lower than the ideal speedup, largely because of load imbalance between the processors. This load imbalance is caused by dierences in the geometries of the sub-domains. As a result, the time required to set up and apply the Schur-complement and the preconditioner varies between the processors. The total execution time (including I/O) of the extended GCR algorithm decreases from 270 s on 4 processors to 53 s on 32 processors. When the 3-D test problem was solved with GMRES on 32 processors it took 2·5 times longer than the extended GCR algorithm, namely 130 s. Copyright ? 1999 John Wiley & Sons, Ltd. Int. J. Numer. Meth. Engng. 44, 641–656 (1999) 652 F. J. LINGEN Figure 11. The speedup for the 3-D test problem on a Cray T3E. Both the speedup with input and output (I/O) and the speedup without I/O are shown As the above results show, the extended GCR algorithm can perform well on a parallel computer, especially when the system of equations is relatively large. The algorithm is also easy to implement on a parallel computer, since its main operations are a matrix–vector product and the orthogonalization of the search vectors. Both these operations are straightforward to parallelize within the domain decomposition approach. 5. LIMITATIONS OF THE EXTENDED GCR ALGORITHM The extended GCR algorithm works well for the two test problems because the solutions for two successive right-hand sides are very similar. Consequently, the solution for the new right-hand side will have a large component which lies in the space spanned by the previously generated search vectors. This is sketched in Figure 12, which shows two contour lines of two successive solutions of the 2-D test problem. Because the two solutions largely overlap—indicated by the dark gray area—it is possible to construct a good approximation of the solution ai+1 by using the previously generated search vectors. The extended GCR algorithm does not perform as well when the solution for one right-hand side has a large component which lies outside the space spanned by the previously generated search vectors. To illustrate this, the extended GCR algorithm was used to compute the steadystate solution of the 2-D test problem for dierent positions xs of the centre of the source (see equation (4)). The centre of the source was moved in 11 steps along the y-axis, so that in the ith step the centre was positioned at (20; 2(i − 1)). The radius of the source was reduced to R = 1 to ensure that the source at step i did not overlap with the sources at steps i − 1 and i + 1. The steady-state solutions for dierent positions of the source were obtained by solving the following systems of equations: Kai = fi Copyright ? 1999 John Wiley & Sons, Ltd. Int. J. Numer. Meth. Engng. 44, 641–656 (1999) GENERALISED CONJUGATE RESIDUAL METHOD 653 Figure 12. Schematic representation of the contour lines of two successive solutions of the 2-D test problem. The overlap between the two solutions is indicated by the dark gray area Figure 13. Schematic representation of the contour lines of the steady state solutions for two successive positions of the source. The overlap between the solutions is indicated by the dark gray area where the matrix K is the same as in equation (5), and the vector fi contains the contributions of the source. Because the mass matrix M is no longer present, it was necessary to increase the diusivity to a value of 0·1 to avoid unrealistic oscillations in the solutions. Figure 13 sketches the contour lines of the steady-state solution for two successive positions of the source. Because the overlap between the two solutions is small—the dark gray area—it is not possible to construct a good approximation of the solution ai+1 by using the previously generated search vectors. This is conrmed in Figure 14, which shows the number of extended GCR iterations for each position of the source. Apparently, the number of iterations decreases slowly, compared with Figure 6. 6. CONCLUSIONS The extended GCR algorithm can be an ecient method for solving non-symmetric systems of equations with multiple right-hand sides. On a single processor, it depends on the problem that is solved whether the extended GCR algorithm is a viable alternative to a direct solver. For instance, when the extended GCR algorithm was used to solve a 3-D convection–diusion problem on one processor of an SGI Origin200, it was competitive with a direct solver based on LDU-factorization. To be precise, the extended GCR algorithm was 1·8 times faster and required 44 per cent less Copyright ? 1999 John Wiley & Sons, Ltd. Int. J. Numer. Meth. Engng. 44, 641–656 (1999) 654 F. J. LINGEN Figure 14. Number of GCR-iterations that are required to obtain the steady-state solution for dierent positions of the source memory than the direct solver. However, when the extended GCR algorithm was used to solve a 2-D test problem, it performed less than the direct solver. The GCR algorithm was 2·6 times slower and required the same amount of memory. The extended GCR algorithm is especially suited for parallel computers, because it is relatively easy to implement, and it has the potential to scale well. For example, when the extended GCR algorithm was run on a Cray T3E, the solution time of the 3-D test problem decreased by a factor 6·7 when 8 times as many processors were used. When the extended GCR algorithm was used to solve the 2-D test problem, however, the solution time decreased only by a factor 8·4 when 12 times as many processors were used. In this case the inter-processor communication constituted a signicant fraction of the total solution time. Although the extended GCR algorithm requires more memory than a conventional iterative solver, such as GMRES, it can be signicantly faster. When GMRES was used to solve the two test problems on the Origin200, it took 2·3 times longer for the 2-D test problem, and 3·0 times longer for the 3-D test problem. In some specic cases, however, the extended GCR algorithm will not perform much better than a conventional iterative solver. This was conrmed by applying the extended GCR algorithm to another, specially designed, convection–diusion problem. APPENDIX An ecient orthogonalization procedure for parallel computers An important part of the GCR algorithm—and of many other iterative solution algorithms—is the orthogonalization of a new search vector v̂ j with respect to the previous, orthonormal, search Copyright ? 1999 John Wiley & Sons, Ltd. Int. J. Numer. Meth. Engng. 44, 641–656 (1999) GENERALISED CONJUGATE RESIDUAL METHOD 655 vectors v k ; k = 1; : : : ; j − 1. This is usually done with the modied Gram–Schmidt procedure: for k = 1; : : : ; j − 1 = hv̂ j ; v k i v̂ j = v̂ j − v k endfor v j = v̂ j where v j denotes the new, orthogonal, search vector. Note that in the GCR algorithm, this vector must also be normalized. For simplicity sake, we will disregard the normalization step. While the modied Gram–Schmidt procedure is numerically stable, it performs poorly on a parallel computer because the j − 1 scalar products depend one each other, that is, the kth scalar product cannot be performed before the (k − 1)th product has been completed. As a consequence, the modied Gram–Schmidt procedure requires j − 1 global communications between the processors. An alternative procedure is classical Gram–Schmidt: for k = 1; : : : ; j − 1 k = hv̂ j ; v k i endfor for k = 1; : : : ; j − 1 v̂ j = v̂ j − k v k endfor v j = v̂ j Since the scalar products in the rst part of the algorithm are independent of each other, this procedure requires only one global communication. However, classical Gram–Schmidt is not numerically stable. To improve its stability, it is possible to apply classical Gram–Schmidt more than once, as was shown by Bjorck.7 In this research, we adopted another orthogonalization procedure that is at least as stable as modied Gram–Schmidt and requires only two, maybe three, global communications. This procedure, which is also briey mentioned by Bjorck, makes use of the Conjugate Gradient method. To explain the procedure, we write the new, orthogonalized, search vector as v j = v̂ j − Vj q j where Vj is a non-rectangular matrix. It has j − 1 columns which are equal to the previous, orthonormal, search vectors v k ; k = 1; : : : ; j − 1. The vector q j has length j − 1 and is the solution of the following least-squares problem: kv̂ j − Vj q j k min j q The vector q j can be computed eciently by applying the Conjugate Gradient (CG) method to the system of normal equations (Vj )T Vj q j = (Vj )T v̂ j Since the columns of the matrix Vj are (nearly) orthonormal, the coecient matrix (Vj )T Vj is very close to the identity matrix. Consequently, only a very few CG iterations are required to compute a suciently accurate solution. Copyright ? 1999 John Wiley & Sons, Ltd. Int. J. Numer. Meth. Engng. 44, 641–656 (1999) 656 F. J. LINGEN Within each iteration, the CG method requires a matrix–vector product of the form y = (Vj )T Vj x, which can be performed eciently in two steps. First, compute z = Vj x, then compute y = (Vj )T z. A practical implementation of these two steps is given below: z=0 for k = 1; : : : ; j − 1 z = z + xk v k endfor for k = 1; : : : ; j − 1 yk = hv k ; zi endfor where xk and yk denote the kth component of x and y, respectively. The rst step requires no communication between the processors. The second step requires only one global communication, because all the scalar products are independent of each other. ACKNOWLEDGEMENTS I would like to thank Rene de Borst for reading an early version of this paper, and for giving valuable feedback. I would also like to thank the centre for High Performance Applied Computing at the Delft University of Technology for providing computation time on their Cray T3E. REFERENCES 1. C. Farhat, L. Crivelli and F. X. Roux, ‘Extending substructure based iterative solvers to multiple load and repeated analyses’, Comput. Meth. Appl. Mech. Engng., 117, 195–209 (1994). 2. S. C. Eisenstat, H. C. Elman and M. H. Schultz, ‘Variational iterative methods for nonsymmetric systems of linear equations’, SIAM J. Numer. Anal., 20(2), 345–357 (1983). 3. Y. Saad and M. H. Schultz, ‘GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems’, SIAM J. Sci. Statist. Comput., 7(3), 856–869, (1986). 4. Y. Saad, Iterative Methods for Sparse Linear Systems, PWS Publishing Company, Boston, 1996. 5. B. Smith, P. Bjrstad and W. Gropp, Domain Decomposition. Parallel Multilevel Methods for Elliptic Partial Dierential Equations, Cambridge University Press, Cambridge, 1996. 6. E. J. Lingen, ‘Design and implementation of a nite element package for parallel computers’, in Proc. 2nd Int. DIANA Conf. Computational Mechanics, Amsterdam, The Netherlands, June 1997. 7. A. Bjorck, ‘Numerics of Gram-Schmidt orthogonalization’, Linear Algebra Appl., 197, 198, 279–316 (1994). Copyright ? 1999 John Wiley & Sons, Ltd. Int. J. Numer. Meth. Engng. 44, 641–656 (1999)

1/--страниц