# Extensions to pedigree analysis. IV. Covariance components models for multivariate traits

код для вставкиСкачатьAmerican Journal of Medical Genetics 14513-524 (1983) Extensions to Pedigree Analysis. IV. Covariance Components Models for MuItivariate Traits Kenneth Lange and Michael Boehnke Department of Biomathematics, University of California, Los Angeles Often more than one quantitative trait is measured on each person in a set of pedigrees. The present paper suggests a class of covariance components models that will allow investigators to explore the genetic and environmental relationships between two quantitative traits. The theoretical framework for the models is given and criticized. We also discuss specific maximum likelihood methods for parameter estimation and hypothesis testing. Key words: pedigree analysis, variance components, maximum likelihood INTRODUCTION Human geneticists often gather pedigree data on multivariate quantitative traits. Such data present novel problems of analysis that do not arise for univariate traits. For instance, if one looks at systolic and diastolic blood pressure levels, it is natural to ask whether these two traits are under the same genetic and environmental control. The purpose of this paper is to present a framework for answering such questions. Other authors have approached problems of multivariate data by discriminant analysis [Namboodiri et al, 1975; Elston et al, 19761, factor analysis [Martin and Eaves, 19771, and path analysis [Moll et al, 1978; Colletto et al, 19811. There is also ample precedent in the animal-breeding literature for covariance analysis of multivariate traits [Henderson, 1953; Schaeffer et al, 19781. Our aim is to discuss a class of models that is amenable to maximum likelihood methods for pedigrees of moderate size. In our view, competing methods of analysis are not sufficiently flexible to handle simultaneously the lack of balance and the rich correlational architecture of pedigree data. To implement maximum likelihood we advocate the scoring algorithm, although other methods are clearly feasible. Received for publication February 19, 1982; revision received May 2, 1982. Address reprint requests to Kenneth Lange, Department of Biomathematics, School of Medicine, University of California, Los Angeles, CA 90024. 0148-7299/83/1403-0513$03.50 0 1983 Alan R. Liss, Inc. 514 Lange and Boehnke Our presentation will be partly expository. To fix definitions we explain in a concise way the origin of the covariance components that enter into our models. This material is classical and is summarized in a more general setting by Kempthorne [ 19691; see also the historical references [Hazel, 1943; Kempthorne, 1954, 1955; Reeve, 1955; Mode and Robinson, 1959; Robertson, 19591. For the most part we will illustrate our arguments by referring to bivariate traits. This eases the notational burden somewhat without sacrificing any real generality. Also, what we present is apt to have the greatest value for the analysis of bivariate traits. To avoid any misunderstanding, we do not think these models can do justice to the complexities of human behavior data. We have in mind the analysis of biological traits that are unlikely to involve assortative mating or cultural inheritance. METHOD Covariances Between Relatives Consider two different quantitative traits Xi and Yi defined for each person i in a pedigree. Assume that Xi and Yi are both single locus traits unaffected by the surrounding environment. Now let j be a second person in the pedigree. We do not exclude the case i = j. If the underlying loci are distinct and are in Hardy-Weinberg and linkage equilibrium, then the covariance is COV(Xi, Yj) = 0. If the loci coincide, it is still possible to specify this covariance. Suppose that the common locus is autosomal and that neither i nor j is inbred. Then it is well known that where aijis the kinship coefficient of i and j , Aij is Jacquard’s [1974] condensed coefficient of identity A7, a,,, is the additive genetic variance of the X trait, and adxx is the dominance genetic variance of the X trait [Kempthorne, 1969; Jacquard, 1974; Crow and Kimura, 19701. To compute Cov(Xi, Yj) consider the artificial trait Zi = Xi + Yi. Since COV(Zi,Zj) = COV(Xi,Xj) + COV(Yi,Xj) + + COV(Xi,Yj) COV(Yi,Yj), and since by symmetry COV(Xi,Yj) = C0V(Yi,Xj), rearrangement of (2) with substitution from (1) yields (2) Pedigree Analysis for Multivariate Traits 515 It is natural to define the additive cross covariance aaxy = - s(uazz aaxx - uayy) and the dominance cross covariance udxy = s(udzz - udxx - udyy>. With these new covariance components, Cov(Xi, Yi) has the same form as (1). For each pedigree it is convenient to collect the coefficients CPii and Aii into matrices CP and A. Kempthorne [1969] discusses an algorithm for computing the entries of CP that is superior in the present context to the path tracing algorithm of Wright [1922]. From CP the matrix A can be computed by a formula of Cotterman [Crow and Kimura, 19701. Next consider a pedigree of p members. Assume the X and Y traits are determined by the same locus and define the random column vector W = ( X I , . .,Y,,Y 1,. ..,Y$, where the superscript t denotes vector or matrix transpose. We can now write an expression for the covariance matrix Cov(W) using partitioned matrices [Rao, 19731. The “0” blocks in (3) represent p by p matrices whose entries are all zero. The expression (3) extends to the summed contributions from several different loci, s = c i wi, provided the depicted covariance components are also the appropriate summed components.’ Environmental Decompositions Each environmental decomposition will correspond to a unique partition of the pedigree into “spheres of environmental influence. ” Naturally, there are several ways of partitioning any pedigree. One possibility is the partition whose blocks all contain just one individual. Another possibility is the partition whose blocks consist of households within the pedigree. A subpartition of this second partition can be realized 516 Lange and Boehnke by dividing the household blocks into separate adult and child blocks. To model maternal effects one can form blocks by lumping all children with the same mother, regardless of whether they are full sibs or belong to the same household. Suppose now that a partition of the pedigree is given. Let all the people within a block of this partition be exposed to the same environment, and let the environmental contributions among blocks be independent and identically distributed. If V is the random vector of contributions and B is the p by p symmetric matrix whose entry bij equals 1 if i and j belong to the same block and equals 0 otherwise, then g ): Cov(V) = a,,, + (: ); a,, ( ;;) + (4) ’ where a,,, and aeYy represent the variances of the environmental contributions to the X and Y traits of each person in the pedigree and a,,, represents the corresponding covariance. Under a completely additive model, the trait values Z from a pedigree can be represented as z = s + v, + * * . + Vd, where S is the summed genetic contributions from many loci and vk, 1 5 k Id, is the environmental contribution from the kth of d environmental decompositions. If S and the Vk are independent, then the covariance D of Z satisfies D = Cov(S) + Cov(V,) + - * a + COV(V(j) With appropriate relabeling, the representations (3) and (4) then imply n = c ok k w k , where the a k are covariance components and the Dk are symmetric matrices. In the bivariate case, there are three additive components, three dominance components, and three environmental components for each environmental decomposition. Mean Components One may choose to ignore the issue of mean components in the sense of doing all regression prior to covariance components analysis. In other words, regression can be performed assuming the values of different individuals are uncorrelated. In many practical problems this two-stage approach should have little effect on the estimates of either mean or covariance components. But for the sake of completeness, let us show how estimation of mean components fits into our proposed model. It is convenient to express the vector of trait means within a pedigree as Ap. The design matrix A depends on information about pedigree members other than their trait values, and the vector p of mean components consists of r parameters to be Pedigree Analysis for Multivariate Traits 517 estimated. For instance, p might be (pf,, pm, a,, p f ~ pmy, , ay)t,where pf, is the female mean of the X trait, pmxis the male mean of the X trait, axis the regression coefficient of the X trait on age, and pfy, pmy,and aYare defined similarly for the Y trait. A sample row of A then might be (O,O,O,l,O,age). This row would correspond to the Y trait of a female with the given age. Maximum Likelihood Estimation So far we have shown how the trait values over a pedigree can be summarized in a covariance components model. In many situations it is plausible to assume that the vector of trait values Z is normally distributed. The appendix contains a terse discussion of a multivariate central limit theorem for a purely genetic trait over a pedigree. Under the further assumption of normality, it is possible to give a brief outline of the scoring algorithm [Rao, 19731 for the maximum likelihood estimation of the mean components p = ( p , , . . . , p # and the covariance components a = (al,...,Qt. Lange et a1 [ 19761 and Jennrich and Sampson [ 19761 provide a detailed derivation. Let L be the log likelihood of the observed trait values from a pedigree. Then ignoring an irrelevant constant, L = - ?hIn J Q I - %(Z - A P ) ~ Q - '(Z - Ap), where I Q I is the determinant of Q. If tr denotes matrix trace, the score vector and information matrix entries are Let 518 Lange and Boehnke aL a2L and similarly for - and -. delta, au a Because of 2 . acL Since - = with (6kl,...&)' 6kj the Kronecker apk (3,the information matrix is block diagonal with blocks a2L a2L E(- 7) and E(- -) . acL a2 If there are n independent pedigrees, let L k be the log likelihood for the kth pedigree. Then the scoring algorithm updates p and u by adding the increments Au = [k:l c E(- a2 k=l au Thus, the scoring algorithm is straighforward to implement. The major drawbacks are the matrix inversion Q2' and matrix multiplications Q-'Qk that must be done for each pedigree at each iteration. The scoring algorithm enjoys two clear advantages over alternatives such as the Newton-Raphson method of maximizing the likelihood. First, if combined with step halving, scoring always leads to an increase in the likelihood [Spence et al, 19771. Second, there are in most cases good starting values that produce least squares estimates of the covariance components in just one iteration [Spence et al, 19771. The second advantage is possible whenever the initial covariance matrix Q for each pedigree can be taken as the identity matrix. (The proof in the appendix of Spence et al, [1977] continues to hold.) For instance, to start with the identity matrix assume all trait variation is determined by the environmental partition whose blocks are individuals. In the bivariate case take the two trait contributions to have unit variance and zero covariance. Hypothesis Testing Once maximum likelihood estimates are available, one can test hypotheses of interest by the likelihood ratio criterion. This procedure and genetic counseling procedures are discussed in Lange et a1 [1976] in detail. Let us add that for bivariate Pedigree Analysis for Multivariate Traits 519 traits some new hypotheses arise. Thus, one might wish to test u,,!, = 0 or udxy = 0. If both uaxy = 0 and udxy = 0 are accepted, than one might tentatively conclude that the X and Y traits are under the control of different genes. (Note that a,,, = C ukaxy k 0 does not necessarily imply that each ukaxy = 0, where k ranges over different loci.) To test the overrall goodness of fit of a model one can proceed in two ways. In the univariate case, Spence et a1 [ 19771 suggest forming standardized residuals. This is done for a given individual by predicting his trait mean and variance conditional on the trait values of the remaining members of his pedigree. The standardized residual for the individual is then arrived at by subtracting the predicted mean from his actual trait value and dividing the result by the predicted standard deviation. Hopper and Mathews [ 19821 recommend computing standardized residuals as a way of spotting outlier individuals. Since the standardized residuals within a pedigree will not be independent, they do not offer an exact method of testing the overall goodness of fit of a model. Nonetheless, a normal probability plot of all the standardized residuals would be a valuable visual aid. For bivariate traits, there would naturally be two standardized residuals per individual. Ott [I9791 recommends computing the random vector Q-”(Z - Ap) using the estimated mean and variance components. With the true values for these parameters, the elements of Q-”(Z - Ap) would be independent, standard normal random variables. This fact permits rigorous testing of a model. Hopper and Mathews [I9821 further note that the statistic = should have approximately a chi-square distribution. It therefore affords a method of spotting outlier pedigrees. The statistic (6) is obviously directly available from the final iteration of the maximum likelihood search. Unfortunately, Hopper and Mathews [ 19821 also point out that (6) summed over all pedigrees should be equal to the total number of observations in the whole sample. The summed statistic, therefore, cannot serve as an overall goodness-of-fit criterion. Linear Versus Nonlinear Models The models discussed so far are linear in both the mean and the covariance components. In path analysis and other branches of statistics, nonlinear models routinely arise. Robert Jennrich has pointed out to us that the scoring algorithm requires only trivial modifications to accomodate nonlinear effects. Suppose the mean vector A(p) and the covariance matrix Q(u)of a pedigree depend nonlinearly on the mean components p and the covariance components u. Then in the scoring algorithm should be replaced by the vector of partial derivatives 520 Lange and Boehnke and Qk by the matrix of partial derivatives The rest of the details of the scoring algorithm remain the same except that the first iteration does not necessarily yield least-squares estimates of the covariance components. There is no reason that nonlinear path coefficient models cannot be handled within the present framework. Summarizing pedigree data by a sequence of correlations between various types of relatives can force awkward compromises in statistical analysis. For instance, in computing parent-offspring correlations, somehow, larger sibships must be given more weight than smaller ones. It is probably better to view path analysis solely as a theoretical method for generating covariance or correlation matrices. Finally, it would also appear that the maximum likelihood approach outlined here could be used to advantage in purely cultural inheritance models [Feldman and Cavalli-Sforza, 19751. DISCUSSION The models discussed here date back to Fisher [1918] and have undergone modification by two generations of geneticists. The main defect is the blanket assumption of additivity. This assumption excludes interactions among loci and between the genetic and environmental contributions. Also, we have added a normality assumption. These two assumptions can be at best an approximation to the biological reality of most traits. We like the additivity assumption for three reasons: First, it makes mathematical and statistical analysis possible. Without additivity even the two locus case becomes very complicated [Cockerham, 1956; Denniston, 19751. Second, the assumption does possess a certain ring of plausibility. Finally, both the additivity and normality assumptions can be checked empirically by assessing the overall goodness of fit of a model. As mentioned earlier, these models do not allow for assortative mating and cultural inheritance. Our “environmental spheres of influence” do not really mimic cultural inheritance. Cultural inheritance should result in correlations between relatives that gradually diminish as the cultural distance between the relatives increases. The environmental influences that we postulate die off abruptly. Despite these caveats, we think data analysis can proceed productively along the lines sketched. Future publications will illustrate some of the possibilities. In our opinion, there are still a host of medically interesting traits that have received little attention. A covariance components analysis can be an important first step in their understanding. Pedigree Analysis for Multivariate Traits 521 ACKNOWLEDGMENTS We are indebted to Robert Elston, Charles Sing, and Patricia Moll for their useful criticisms of the text. We also wish to thank Robert Jennrich for his help with the scoring algorithms and John Hopper for explaining how to deal with outliers. In part, this research was supported by University of California, Los Angeles; NIH Rescearch Career Development award KO4 HD00307; and USPHS-National Research Service awards GM 07104 and GM 07191. REFERENCES Cockerham CC (1956): Effects of linkage on the covariances between relatives. Genetics 41: 138-141. Colletto GMDD, Krieger H, Magalhies JR (1981): Estimates of the genetic and environmental determinants of serum lipid and lipoprotein concentrations in Brazilian twins. Hum Hered 3 1:232-237. Crow JF, Kimura M (1970): “An Introduction to Population Genetics Theory.” New York: Harper and ROW,pp 115-141. Denniston C (1975): Probability and genetic relationships: Two loci. Ann Hum Genet 39:89-104. Elston RC, Graham JB, Miller CH, Reisner HM, Bouma BN (1976): Probabilistic classification of hemophilia A carriers by discriminant analysis. Thromb Res 8:683-695. Feldman MW, Cavalli-Sforza LL (1975): Models for cultural influence: A general linear model. Ann Hum Biol 2:215-226. Feller W (1971): “An Introduction to Probability Theory and its Applications.” 2nd Ed. New York: John Wiley and Sons, Vol. 2, pp 518-521. Fisher RA (1918): The correlation between relatives on the supposition of Mendelian inheritance. Trans R SOCEdinburgh 52:399-433. Hazel LN (1943): The genetic basis for constructing selection indexes. Genetics 28:476-490. Henderson CR (1953): Estimation of variance and covariance components. Biometrics 9:226-252. Hopper JL, Mathews JD (1982): Extensions to multivariate normal models for pedigree analysis. Ann Hum Genet 46:373-383. Jacquard A (1974): “The Genetic Structure of Populations. ” New York: Springer-Verlag. Jennrich RI, Sampson PF (1976): Newton-Raphson and related algorithms for maximum likelihood variance component estimation. Technometrics 18: 11- 17. Kempthorne 0 (1954): The correlation between relatives in a random mating population. Proc R Soc London Ser B 143:103-113. Kempthorne 0 (1955): The theoretical values of correlations between relatives in random mating populations. Genetics 40: 153-167. Kempthorne 0 (1969): “An Introduction to Genetic Statistics.” Ames, Iowa: Iowa State University Press, pp 74-77, 264267. Lange K (1978): Central limit theorems for pedigrees. J Math Biol 6:59-66. Lange K, Westlake J, Spence MA (1976): Extensions to pedigree analysis 111. Variance components by the scoring method. Ann Hum Genet 39:485-491. Martin NG, Eaves LJ (1977): The genetical analysis of covariance structure. Heredity 38:79-95. Mode CJ, Robinson HF (1959): Pleiotropisni and the gcnetic variance and covariance. Biometrics 15:518-537. Moll PP, Sing CF, Brewer GJ, Gilroy TE (1978): Multivariate analysis of the genetic effects of red cell glycolysis. In Brewer GJ (ed): “The Red Cell.” New York: Alan R. Liss, Inc. Namboodiri KK, Elston RC, Glueck CJ, Fallat R, Buncher CR, Tsang R (1975): Bivariate analysis of cholesterol and triglyceride levels in families in which probands have type IIb lipoprotein phenotype. Am J Hum Genet 27:454-471. Orey S (1958): A central limit theorem for m-dependent random variables. Duke Math J 25543-546. Ott J (1979): Maximum likelihood estimation by counting methods under polygenic and mixed models in human pedigrees. Am J Hum Genet 31: 161-175. Rao CR (1973): “Linear Statistical Inference and its Applications.” 2nd Ed. New York: John Wiley and Sons, pp 17, 128-129, 366-374. 522 Lange and Boehnke Reeve ECR (1955): The variance of the genetic correlation coefficient. Biometrics 11:357-374. Robertson A (1959): The sampling variance of the genetic correlation coefficient. Biometrics 15:469485. Schaeffer LR, Wilton JW, Thompson R (1978): Simultaneous estimation of variance and covariance components from multitrait mixed model equations. Biometrics 34: 199-208. Spence MA, Westlake J, Lange K (1977): Estimation of the variance components for dermal ridge count. Ann Hum Genet 41:111-115. Wright S (1922): Coefficients of inbreeding and relationship. Am Nat 56:330-338. APPENDIX Let us imagine a potentially infinite number of pairs of homologous autosomes and consider a finite number of loci on each pair. In fact, to prevent clustering of loci, we impose an upper bound m on the number of loci possible for each pair. Now number the loci starting with those on the first pair, then proceeding with those on the second pair, and so forth. As before, assume all loci are in Hardy-Weinberg and linkage equilibrium and let Wk represent the contribution of the kth locus to a trait defined over a pedigree. The trait may be univariate or multivariate. Let Wk have r components. * S, = W1 Next, define for each n 2 1 two random vectors S, and W,; is the same as S, except that each component of 5, is standardized to have mean 0 and variance 1. Under certain conditions S, will follow an approximate multivariate normal distribution for n large. The relevant requirements for asymptotic normality are 1) W1, W2,... is an m-dependent sequence, ie, Wk is independent of wj whenever 1 k -j 1 > m; 2) each component sequence of the vector sequence W1, W2,... satisfies Lindeberg’s condition [Feller, 19711; and 3) the correlation matrices Cov(s,) tend to some limit C as n M. Lange [ 19781 proves a central limit theorem under less general assumptions. Let us discuss these conditions in order. W W2,... is m-dependent because loci k and j are on different chromosomes when I k-j 1 > m. To discuss the second and third conditions let us again assume that our trait is bivariate. For the kth locus define the covariance components ukaxx, Ukaxy, and so forth. Lindeberg’s condition then implies s, s,. + + - and likewise for the Y trait. Conversely, Lindeberg’s condition is satisfied if the Wk are uniformly bounded and and likewise for the Y trait. Pedigree Analysis for Multivariate Traits 523 If we examine (3), it is clear that a sufficient condition for the existence of lim Cov(sn) is the existence of the following limits: n--m lim n-CG ( $, ) ( i: k = l Okaxx Okaxx + okdxx - Paxx I The first two limits insure the existence of limits Pdxx = l--paxx and Pdyy = 1-payy. With this notation, the limiting correlation matrix for Cov(Sn) is given by (3) provided one substitutes the above p’s for the cr’s. To prove approximate multivariate normality we can assume without loss of generality that each Wk has mean 0. It then suffices to show that for every constant vector u, u‘g, tends in distribution to a univariate normal distribution with mean zero and variance u‘Cu [Rao, 19731. Suppose first that utCu = 0. Our assertion is then obvious because convergence in mean square to zero implies convergence in distribution to zero. Next suppose u ‘ h > 0. Let w k j and Snjbe the jth components of Wk and S,, respectively. Define the random variables Unkj = Wkj/Var(S,j)” , for each n 2 1, 1 < k < n, and 1 < j < r. The random variables Tnk form a triangular array whose rows have uncorrelated entries with the m-dependence property. Furthermore, n c k=l Tnk = utSn. 524 Lange and Boehnke According to a result of Orey [1958], it suffices to prove that the triangular array {Tnk},n 2 1, 1 < k 6 n, satisfies Lindeberg’s condition. Thus for each 6 > 0 we must prove lim n4- c 3 T~~* d P = 0 . k=l If we apply the Cauchy-Schwarz inequality to each I TnkI , we can reduce our problem to showing r for each j , where Ank = { ,x U;, 1=1 > E * } with c * = c2/ I I p I I ’. Now note that with Since we need only invoke the Lindeberg hypothesis for each component sequence to complete the proof. Edited by John M. Opitz

1/--страниц