вход по аккаунту


Extensions to pedigree analysis. IV. Covariance components models for multivariate traits

код для вставкиСкачать
American Journal of Medical Genetics 14513-524 (1983)
Extensions to Pedigree Analysis.
IV. Covariance Components Models for
MuItivariate Traits
Kenneth Lange and Michael Boehnke
Department of Biomathematics, University of California, Los Angeles
Often more than one quantitative trait is measured on each person in a set of
pedigrees. The present paper suggests a class of covariance components models
that will allow investigators to explore the genetic and environmental relationships
between two quantitative traits. The theoretical framework for the models is given
and criticized. We also discuss specific maximum likelihood methods for parameter estimation and hypothesis testing.
Key words: pedigree analysis, variance components, maximum likelihood
Human geneticists often gather pedigree data on multivariate quantitative traits.
Such data present novel problems of analysis that do not arise for univariate traits.
For instance, if one looks at systolic and diastolic blood pressure levels, it is natural
to ask whether these two traits are under the same genetic and environmental control.
The purpose of this paper is to present a framework for answering such questions.
Other authors have approached problems of multivariate data by discriminant analysis
[Namboodiri et al, 1975; Elston et al, 19761, factor analysis [Martin and Eaves,
19771, and path analysis [Moll et al, 1978; Colletto et al, 19811. There is also ample
precedent in the animal-breeding literature for covariance analysis of multivariate
traits [Henderson, 1953; Schaeffer et al, 19781.
Our aim is to discuss a class of models that is amenable to maximum likelihood
methods for pedigrees of moderate size. In our view, competing methods of analysis
are not sufficiently flexible to handle simultaneously the lack of balance and the rich
correlational architecture of pedigree data. To implement maximum likelihood we
advocate the scoring algorithm, although other methods are clearly feasible.
Received for publication February 19, 1982; revision received May 2, 1982.
Address reprint requests to Kenneth Lange, Department of Biomathematics, School of Medicine,
University of California, Los Angeles, CA 90024.
0148-7299/83/1403-0513$03.50 0 1983 Alan R. Liss, Inc.
Lange and Boehnke
Our presentation will be partly expository. To fix definitions we explain in a
concise way the origin of the covariance components that enter into our models. This
material is classical and is summarized in a more general setting by Kempthorne
[ 19691; see also the historical references [Hazel, 1943; Kempthorne, 1954, 1955;
Reeve, 1955; Mode and Robinson, 1959; Robertson, 19591. For the most part we
will illustrate our arguments by referring to bivariate traits. This eases the notational
burden somewhat without sacrificing any real generality. Also, what we present is
apt to have the greatest value for the analysis of bivariate traits.
To avoid any misunderstanding, we do not think these models can do justice to
the complexities of human behavior data. We have in mind the analysis of biological
traits that are unlikely to involve assortative mating or cultural inheritance.
Covariances Between Relatives
Consider two different quantitative traits Xi and Yi defined for each person i in
a pedigree. Assume that Xi and Yi are both single locus traits unaffected by the
surrounding environment. Now let j be a second person in the pedigree. We do not
exclude the case i = j. If the underlying loci are distinct and are in Hardy-Weinberg
and linkage equilibrium, then the covariance is
COV(Xi, Yj)
If the loci coincide, it is still possible to specify this covariance. Suppose that
the common locus is autosomal and that neither i nor j is inbred. Then it is well
known that
where aijis the kinship coefficient of i and j , Aij is Jacquard’s [1974] condensed
coefficient of identity A7, a,,, is the additive genetic variance of the X trait, and adxx
is the dominance genetic variance of the X trait [Kempthorne, 1969; Jacquard, 1974;
Crow and Kimura, 19701. To compute Cov(Xi, Yj) consider the artificial trait Zi =
Xi + Yi. Since
COV(Zi,Zj) = COV(Xi,Xj)
and since by symmetry
COV(Xi,Yj) = C0V(Yi,Xj),
rearrangement of (2) with substitution from (1) yields
Pedigree Analysis for Multivariate Traits
It is natural to define the additive cross covariance
- uayy)
and the dominance cross covariance
- udxx - udyy>.
With these new covariance components, Cov(Xi, Yi) has the same form as (1).
For each pedigree it is convenient to collect the coefficients CPii and Aii into
matrices CP and A. Kempthorne [1969] discusses an algorithm for computing the
entries of CP that is superior in the present context to the path tracing algorithm of
Wright [1922]. From CP the matrix A can be computed by a formula of Cotterman
[Crow and Kimura, 19701.
Next consider a pedigree of p members. Assume the X and Y traits are determined
by the same locus and define the random column vector W = ( X I , . .,Y,,Y 1,. ..,Y$,
where the superscript t denotes vector or matrix transpose. We can now write an
expression for the covariance matrix Cov(W) using partitioned matrices [Rao, 19731.
The “0” blocks in (3) represent p by p matrices whose entries are all zero. The
expression (3) extends to the summed contributions from several different loci,
provided the depicted covariance components are also the appropriate summed components.’
Environmental Decompositions
Each environmental decomposition will correspond to a unique partition of the
pedigree into “spheres of environmental influence. ” Naturally, there are several ways
of partitioning any pedigree. One possibility is the partition whose blocks all contain
just one individual. Another possibility is the partition whose blocks consist of
households within the pedigree. A subpartition of this second partition can be realized
Lange and Boehnke
by dividing the household blocks into separate adult and child blocks. To model
maternal effects one can form blocks by lumping all children with the same mother,
regardless of whether they are full sibs or belong to the same household.
Suppose now that a partition of the pedigree is given. Let all the people within
a block of this partition be exposed to the same environment, and let the environmental
contributions among blocks be independent and identically distributed. If V is the
random vector of contributions and B is the p by p symmetric matrix whose entry bij
equals 1 if i and j belong to the same block and equals 0 otherwise, then
g ):
Cov(V) = a,,,
(: );
( ;;)
where a,,, and aeYy
represent the variances of the environmental contributions to the
X and Y traits of each person in the pedigree and a,,, represents the corresponding
Under a completely additive model, the trait values Z from a pedigree can be
represented as
s + v, + * * . +
where S is the summed genetic contributions from many loci and vk, 1 5 k Id, is
the environmental contribution from the kth of d environmental decompositions. If S
and the Vk are independent, then the covariance D of Z satisfies
D = Cov(S)
+ Cov(V,)
+ - * a +
With appropriate relabeling, the representations (3) and (4) then imply
n = c ok k w k ,
where the a k are covariance components and the Dk are symmetric matrices. In the
bivariate case, there are three additive components, three dominance components,
and three environmental components for each environmental decomposition.
Mean Components
One may choose to ignore the issue of mean components in the sense of doing all
regression prior to covariance components analysis. In other words, regression can
be performed assuming the values of different individuals are uncorrelated. In many
practical problems this two-stage approach should have little effect on the estimates
of either mean or covariance components. But for the sake of completeness, let us
show how estimation of mean components fits into our proposed model.
It is convenient to express the vector of trait means within a pedigree as Ap.
The design matrix A depends on information about pedigree members other than their
trait values, and the vector p of mean components consists of r parameters to be
Pedigree Analysis for Multivariate Traits
estimated. For instance, p might be (pf,, pm, a,, p f ~ pmy,
ay)t,where pf, is the
female mean of the X trait, pmxis the male mean of the X trait, axis the regression
coefficient of the X trait on age, and pfy, pmy,and aYare defined similarly for the Y
trait. A sample row of A then might be (O,O,O,l,O,age). This row would correspond
to the Y trait of a female with the given age.
Maximum Likelihood Estimation
So far we have shown how the trait values over a pedigree can be summarized
in a covariance components model. In many situations it is plausible to assume that
the vector of trait values Z is normally distributed. The appendix contains a terse
discussion of a multivariate central limit theorem for a purely genetic trait over a
pedigree. Under the further assumption of normality, it is possible to give a brief
outline of the scoring algorithm [Rao, 19731 for the maximum likelihood estimation
of the mean components p = ( p , , . . . , p # and the covariance components a =
(al,...,Qt. Lange et a1 [ 19761 and Jennrich and Sampson [ 19761 provide a detailed
Let L be the log likelihood of the observed trait values from a pedigree. Then
ignoring an irrelevant constant,
L = - ?hIn
J Q I - %(Z - A P ) ~ Q - '(Z - Ap),
where I Q I is the determinant of Q. If tr denotes matrix trace, the score vector and
information matrix entries are
Lange and Boehnke
and similarly for - and -.
au a
Because of
Since - =
the Kronecker
(3,the information matrix is block diagonal with blocks
E(- 7)
and E(- -)
If there are n independent pedigrees, let L k be the log likelihood for the kth
pedigree. Then the scoring algorithm updates p and u by adding the increments
= [k:l
c E(-
Thus, the scoring algorithm is straighforward to implement. The major drawbacks
are the matrix inversion Q2' and matrix multiplications Q-'Qk that must be done for
each pedigree at each iteration.
The scoring algorithm enjoys two clear advantages over alternatives such as the
Newton-Raphson method of maximizing the likelihood. First, if combined with step
halving, scoring always leads to an increase in the likelihood [Spence et al, 19771.
Second, there are in most cases good starting values that produce least squares
estimates of the covariance components in just one iteration [Spence et al, 19771. The
second advantage is possible whenever the initial covariance matrix Q for each
pedigree can be taken as the identity matrix. (The proof in the appendix of Spence et
al, [1977] continues to hold.) For instance, to start with the identity matrix assume all
trait variation is determined by the environmental partition whose blocks are individuals. In the bivariate case take the two trait contributions to have unit variance and
zero covariance.
Hypothesis Testing
Once maximum likelihood estimates are available, one can test hypotheses of
interest by the likelihood ratio criterion. This procedure and genetic counseling
procedures are discussed in Lange et a1 [1976] in detail. Let us add that for bivariate
Pedigree Analysis for Multivariate Traits
traits some new hypotheses arise. Thus, one might wish to test u,,!, = 0 or udxy = 0.
If both uaxy = 0 and udxy = 0 are accepted, than one might tentatively conclude that
the X and Y traits are under the control of different genes. (Note that a,,, = C ukaxy
0 does not necessarily imply that each ukaxy = 0, where k ranges over different
To test the overrall goodness of fit of a model one can proceed in two ways. In
the univariate case, Spence et a1 [ 19771 suggest forming standardized residuals. This
is done for a given individual by predicting his trait mean and variance conditional on
the trait values of the remaining members of his pedigree. The standardized residual
for the individual is then arrived at by subtracting the predicted mean from his actual
trait value and dividing the result by the predicted standard deviation. Hopper and
Mathews [ 19821 recommend computing standardized residuals as a way of spotting
outlier individuals. Since the standardized residuals within a pedigree will not be
independent, they do not offer an exact method of testing the overall goodness of fit
of a model. Nonetheless, a normal probability plot of all the standardized residuals
would be a valuable visual aid. For bivariate traits, there would naturally be two
standardized residuals per individual.
Ott [I9791 recommends computing the random vector Q-”(Z - Ap) using the
estimated mean and variance components. With the true values for these parameters,
the elements of Q-”(Z - Ap) would be independent, standard normal random
variables. This fact permits rigorous testing of a model. Hopper and Mathews [I9821
further note that the statistic
should have approximately a chi-square distribution. It therefore affords a method of
spotting outlier pedigrees.
The statistic (6) is obviously directly available from the final iteration of the
maximum likelihood search. Unfortunately, Hopper and Mathews [ 19821 also point
out that (6) summed over all pedigrees should be equal to the total number of
observations in the whole sample. The summed statistic, therefore, cannot serve as
an overall goodness-of-fit criterion.
Linear Versus Nonlinear Models
The models discussed so far are linear in both the mean and the covariance
components. In path analysis and other branches of statistics, nonlinear models
routinely arise. Robert Jennrich has pointed out to us that the scoring algorithm
requires only trivial modifications to accomodate nonlinear effects. Suppose the mean
vector A(p) and the covariance matrix Q(u)of a pedigree depend nonlinearly on the
mean components p and the covariance components u. Then in the scoring algorithm
should be replaced by the vector of partial derivatives
Lange and Boehnke
and Qk by the matrix of partial derivatives
The rest of the details of the scoring algorithm remain the same except that the first
iteration does not necessarily yield least-squares estimates of the covariance components.
There is no reason that nonlinear path coefficient models cannot be handled
within the present framework. Summarizing pedigree data by a sequence of correlations between various types of relatives can force awkward compromises in statistical
analysis. For instance, in computing parent-offspring correlations, somehow, larger
sibships must be given more weight than smaller ones. It is probably better to view
path analysis solely as a theoretical method for generating covariance or correlation
matrices. Finally, it would also appear that the maximum likelihood approach outlined
here could be used to advantage in purely cultural inheritance models [Feldman and
Cavalli-Sforza, 19751.
The models discussed here date back to Fisher [1918] and have undergone
modification by two generations of geneticists. The main defect is the blanket assumption of additivity. This assumption excludes interactions among loci and between the
genetic and environmental contributions. Also, we have added a normality assumption. These two assumptions can be at best an approximation to the biological reality
of most traits. We like the additivity assumption for three reasons: First, it makes
mathematical and statistical analysis possible. Without additivity even the two locus
case becomes very complicated [Cockerham, 1956; Denniston, 19751. Second, the
assumption does possess a certain ring of plausibility. Finally, both the additivity and
normality assumptions can be checked empirically by assessing the overall goodness
of fit of a model.
As mentioned earlier, these models do not allow for assortative mating and
cultural inheritance. Our “environmental spheres of influence” do not really mimic
cultural inheritance. Cultural inheritance should result in correlations between relatives that gradually diminish as the cultural distance between the relatives increases.
The environmental influences that we postulate die off abruptly.
Despite these caveats, we think data analysis can proceed productively along
the lines sketched. Future publications will illustrate some of the possibilities. In our
opinion, there are still a host of medically interesting traits that have received little
attention. A covariance components analysis can be an important first step in their
Pedigree Analysis for Multivariate Traits
We are indebted to Robert Elston, Charles Sing, and Patricia Moll for their
useful criticisms of the text. We also wish to thank Robert Jennrich for his help with
the scoring algorithms and John Hopper for explaining how to deal with outliers. In
part, this research was supported by University of California, Los Angeles; NIH
Rescearch Career Development award KO4 HD00307; and USPHS-National Research Service awards GM 07104 and GM 07191.
Cockerham CC (1956): Effects of linkage on the covariances between relatives. Genetics 41: 138-141.
Colletto GMDD, Krieger H, Magalhies JR (1981): Estimates of the genetic and environmental determinants of serum lipid and lipoprotein concentrations in Brazilian twins. Hum Hered 3 1:232-237.
Crow JF, Kimura M (1970): “An Introduction to Population Genetics Theory.” New York: Harper and
ROW,pp 115-141.
Denniston C (1975): Probability and genetic relationships: Two loci. Ann Hum Genet 39:89-104.
Elston RC, Graham JB, Miller CH, Reisner HM, Bouma BN (1976): Probabilistic classification of
hemophilia A carriers by discriminant analysis. Thromb Res 8:683-695.
Feldman MW, Cavalli-Sforza LL (1975): Models for cultural influence: A general linear model. Ann
Hum Biol 2:215-226.
Feller W (1971): “An Introduction to Probability Theory and its Applications.” 2nd Ed. New York:
John Wiley and Sons, Vol. 2, pp 518-521.
Fisher RA (1918): The correlation between relatives on the supposition of Mendelian inheritance. Trans
R SOCEdinburgh 52:399-433.
Hazel LN (1943): The genetic basis for constructing selection indexes. Genetics 28:476-490.
Henderson CR (1953): Estimation of variance and covariance components. Biometrics 9:226-252.
Hopper JL, Mathews JD (1982): Extensions to multivariate normal models for pedigree analysis. Ann
Hum Genet 46:373-383.
Jacquard A (1974): “The Genetic Structure of Populations. ” New York: Springer-Verlag.
Jennrich RI, Sampson PF (1976): Newton-Raphson and related algorithms for maximum likelihood
variance component estimation. Technometrics 18: 11- 17.
Kempthorne 0 (1954): The correlation between relatives in a random mating population. Proc R Soc
London Ser B 143:103-113.
Kempthorne 0 (1955): The theoretical values of correlations between relatives in random mating
populations. Genetics 40: 153-167.
Kempthorne 0 (1969): “An Introduction to Genetic Statistics.” Ames, Iowa: Iowa State University
Press, pp 74-77, 264267.
Lange K (1978): Central limit theorems for pedigrees. J Math Biol 6:59-66.
Lange K, Westlake J, Spence MA (1976): Extensions to pedigree analysis 111. Variance components by
the scoring method. Ann Hum Genet 39:485-491.
Martin NG, Eaves LJ (1977): The genetical analysis of covariance structure. Heredity 38:79-95.
Mode CJ, Robinson HF (1959): Pleiotropisni and the gcnetic variance and covariance. Biometrics
Moll PP, Sing CF, Brewer GJ, Gilroy TE (1978): Multivariate analysis of the genetic effects of red cell
glycolysis. In Brewer GJ (ed): “The Red Cell.” New York: Alan R. Liss, Inc.
Namboodiri KK, Elston RC, Glueck CJ, Fallat R, Buncher CR, Tsang R (1975): Bivariate analysis of
cholesterol and triglyceride levels in families in which probands have type IIb lipoprotein
phenotype. Am J Hum Genet 27:454-471.
Orey S (1958): A central limit theorem for m-dependent random variables. Duke Math J 25543-546.
Ott J (1979): Maximum likelihood estimation by counting methods under polygenic and mixed models
in human pedigrees. Am J Hum Genet 31: 161-175.
Rao CR (1973): “Linear Statistical Inference and its Applications.” 2nd Ed. New York: John Wiley and
Sons, pp 17, 128-129, 366-374.
Lange and Boehnke
Reeve ECR (1955): The variance of the genetic correlation coefficient. Biometrics 11:357-374.
Robertson A (1959): The sampling variance of the genetic correlation coefficient. Biometrics 15:469485.
Schaeffer LR, Wilton JW, Thompson R (1978): Simultaneous estimation of variance and covariance
components from multitrait mixed model equations. Biometrics 34: 199-208.
Spence MA, Westlake J, Lange K (1977): Estimation of the variance components for dermal ridge count.
Ann Hum Genet 41:111-115.
Wright S (1922): Coefficients of inbreeding and relationship. Am Nat 56:330-338.
Let us imagine a potentially infinite number of pairs of homologous autosomes
and consider a finite number of loci on each pair. In fact, to prevent clustering of
loci, we impose an upper bound m on the number of loci possible for each pair. Now
number the loci starting with those on the first pair, then proceeding with those on
the second pair, and so forth. As before, assume all loci are in Hardy-Weinberg and
linkage equilibrium and let Wk represent the contribution of the kth locus to a trait
defined over a pedigree. The trait may be univariate or multivariate. Let Wk have r
S, = W1
Next, define for each n 2 1 two random vectors S, and
is the same as S, except that each component of 5, is standardized to have
mean 0 and variance 1. Under certain conditions S, will follow an approximate
multivariate normal distribution for n large. The relevant requirements for asymptotic
normality are
1) W1, W2,... is an m-dependent sequence, ie, Wk is independent of wj
whenever 1 k -j 1 > m;
2) each component sequence of the vector sequence W1, W2,... satisfies Lindeberg’s condition [Feller, 19711; and
3) the correlation matrices Cov(s,) tend to some limit C as n
Lange [ 19781 proves a central limit theorem under less general assumptions.
Let us discuss these conditions in order. W W2,... is m-dependent because
loci k and j are on different chromosomes when I k-j 1 > m. To discuss the second
and third conditions let us again assume that our trait is bivariate. For the kth locus
define the covariance components ukaxx, Ukaxy, and so forth. Lindeberg’s condition
then implies
and likewise for the Y trait. Conversely, Lindeberg’s condition is satisfied if the Wk
are uniformly bounded and
and likewise for the Y trait.
Pedigree Analysis for Multivariate Traits
If we examine (3), it is clear that a sufficient condition for the existence of
lim Cov(sn) is the existence of the following limits:
( $, ) ( i:
k = l Okaxx
+ okdxx
- Paxx
The first two limits insure the existence of limits Pdxx = l--paxx and Pdyy = 1-payy.
With this notation, the limiting correlation matrix for Cov(Sn) is given by (3) provided
one substitutes the above p’s for the cr’s.
To prove approximate multivariate normality we can assume without loss of
generality that each Wk has mean 0. It then suffices to show that for every constant
vector u, u‘g, tends in distribution to a univariate normal distribution with mean zero
and variance u‘Cu [Rao, 19731. Suppose first that utCu = 0. Our assertion is then
obvious because convergence in mean square to zero implies convergence in distribution to zero.
Next suppose u ‘ h > 0. Let w k j and Snjbe the jth components of Wk and S,,
respectively. Define the random variables
Unkj = Wkj/Var(S,j)” ,
for each n 2 1, 1 < k < n, and 1 < j < r. The random variables Tnk form a
triangular array whose rows have uncorrelated entries with the m-dependence property. Furthermore,
Tnk = utSn.
Lange and Boehnke
According to a result of Orey [1958], it suffices to prove that the triangular array
{Tnk},n 2 1, 1 < k 6 n, satisfies Lindeberg’s condition.
Thus for each 6 > 0 we must prove
c 3 T~~* d P = 0 .
If we apply the Cauchy-Schwarz inequality to each I TnkI , we can reduce our problem
to showing
for each j , where Ank = {
> E * } with c *
= c2/
I I p I I ’.
Now note that
we need only invoke the Lindeberg hypothesis for each component sequence to
complete the proof.
Edited by John M. Opitz
Без категории
Размер файла
564 Кб
extension, model, multivariate, components, analysis, covariance, traits, pedigree
Пожаловаться на содержимое документа