вход по аккаунту


Controlling false discoveries in genetic studies.

код для вставкиСкачать
American Journal of Medical Genetics Part B (Neuropsychiatric Genetics) 147B:637 –644 (2008)
Review Article
Controlling False Discoveries in Genetic Studies
Edwin J.C.G. van den Oord1,2*
Center for Biomarker Research and Personalized Medicine, Medical College of Virginia, Virginia Commonwealth University,
Richmond, Virginia
Virginia Institute for Psychiatric and Behavioral Genetics, Richmond, Virginia
A false discovery occurs when a researcher concludes that a marker is involved in the etiology of
the disease whereas in reality it is not. In genetic
studies the risk of false discoveries is very high
because only few among the many markers that
can be tested will have an effect on the disease. In
this article, we argue that it may be best to use
methods for controlling false discoveries that
would introduce the same ratio of false discoveries divided by all rejected tests into the literature regardless of systematic differences between
studies. After a brief discussion of traditional
‘‘multiple testing’’ methods, we show that methods
that control the false discovery rate (FDR) may be
more suitable to achieve this goal. These FDR
methods are therefore discussed in more detail.
Instead of merely testing for main effects, it may
be important to search for gene–environment/
covariate interactions, gene–gene interactions or
genetic variants affecting disease subtypes. In
the second section, we point out the challenges
involved in controlling false discoveries in such
searches. The final section discusses the role of
replication studies for eliminating false discoveries and the complexities associated with the
definition of what constitutes a replication and
the design of these studies.
ß 2007 Wiley-Liss, Inc.
false discoveries; genome-wide association studies; multiple hypothesis
testing; FDR; data mining; multistage designs
Please cite this article as follows: van den Oord EJCG.
2008. Controlling False Discoveries in Genetic Studies.
Am J Med Genet Part B 147B:637–644.
A false discovery occurs when a researcher concludes that a
marker is involved in the etiology of the disease whereas in
reality it is not. In genetic studies the risk of a false discovery is
very high because only few among all markers that can be
tested will have an effect on the disease. Indeed, it has been
speculated that 19 out of every 20 marker-disease associations
currently reported in the literature are false [Colhoun et al.,
2003]. Phenomena such as population stratification play a role
but failure to exclude chance is the main cause of all these false
Proper methods for controlling false discoveries are important because they can prevent that a lot of time and resources
are spend on leads that will eventually prove irrelevant and
avoid a loss of confidence in research when many publicized
‘‘discoveries’’ are followed by non-replication. These methods
may become even more important considering it has recently
become possible to screen hundreds of thousands to a million
single nucleotide polymorphisms (SNPs) across the whole
genome for their association with a disease. Without proper
control, the number of false discoveries will be proportional to
the number of markers tested and the literature would be
flooded with false discoveries. The question of how to best
control false discoveries is therefore appropriate and timely.
In the first section of this article we focus on significance
testing. We argue that it may be best to use a method that
would produce the same ratio of false discoveries divided by all
rejected tests regardless of systematic differences between
studies. This would ensure that in the long run, we obtain a
desired ratio of false discoveries to all reported discoveries in
the literature. After a brief discussion of traditional ‘‘multiple
testing’’ methods, we show that methods that control the false
discovery rate (FDR) may be more suitable to achieve this goal.
These FDR methods are therefore discussed in more detail.
Instead of merely testing for main effects, it may be important
to search for gene–environment/covariate interactions, gene–
gene interactions or genetic variants affecting disease subtypes. In the second section, we point out the challenges
involved in controlling false discoveries in such searches. The
control of false discoveries is not solely a data analysis problem
and in the final section we argue that (the theory of) adaptive
multistage designs may present advantages in the search for
genetic variants affecting complex diseases.
This article contains supplementary material, which may be
viewed at the American Journal of Medical Genetics website
Grant sponsor: US National Institute of Mental Health; Grant
number: R01 MH065320.
*Correspondence to: Edwin J.C.G. van den Oord, Medical
College of Virginia, Virginia Commonwealth University, P.O. Box
980533, Richmond, VA 23298-0533.
Received 19 June 2007; Accepted 18 September 2007
DOI 10.1002/ajmg.b.30650
ß 2007 Wiley-Liss, Inc.
Significance testing typically starts with calculating Pvalues for each marker. If the calculated P-value is smaller
than a threshold P-value the null-hypothesis, assuming that
the marker has no effect, is rejected and the test is called
significant. A Type I error is the error of rejecting the nullhypothesis when it is true. This results in a false discovery or
false positive.
Controlling the Family Wise Error Rate
Traditional approaches for controlling false discoveries
attempt to maintain a desired probability that a study
produces one or more false discoveries (see supplemental
material for precise definitions of error rates discussed in this
article.). This probability depends on the number of markers
van den Oord
tested. For instance, if a single test is performed using a
threshold P-value of 0.05, the probability of a false discovery is
5% if the marker has no effect. However, if 100,000 markers
without effects are tested using the same 0.05 threshold, the
probability of one or more false discoveries is close to one and
the study may produce about 100,000 0.05 ¼ 5,000 false
discoveries. To counteract this effect of performing multiple
tests, the threshold P-value needs to be adjusted. In the
Bonferroni correction, for example, the corrected P-value
threshold equals the desired probability of producing one or
more false discoveries divided by the number of tests carried
out. Regardless of the number of tests, such as correction would
ensure that less then one out of every 20 studies produces one
or more false discoveries.
Technically speaking traditional methods control the socalled family wise error-rate (FWE, these methods control the
error rate for the whole set of ‘‘family’’ of tests). In the context of
genome-wide scans this has been labeled the genome wise
error-rate [Lander and Kruglyak, 1995; Risch and Merikangas, 1996]. Although the above discussed single step Bonferroni correction is probably the most well known procedure in
this class, it controls the FWE too conservatively at a level
smaller than a thereby sacrificing statistical power. The Šidák
correction [Šidák, 1967] gives exact control of the FWE when
none of the markers have an effect and are independent. If
some of the markers have an effect, step-wise procedures are
generally preferable. The idea is that once one of the null
hypotheses is rejected it cannot any longer be considered true.
We can therefore continue with correcting by a factor (m 1)
rather than m. Holm’s step-down procedure [Holm, 1979] was
one of the first step-wise procedures, but more powerful
variants now exist [Hochberg and Benjamini, 1990; Dunnet
and Tamhane, 1992]. In GWAS where the number of tests is
very large and the number of true effects relatively very small,
the use of the Šidák correction or a step-wise procedures is
unlikely to have a substantial impact on the number of tests
that are declared significant. The fact that control of the FWE is
sensitive to correlated tests may present to biggest challenge.
To illustrate the impact of correlated tests, assume 100 perfectly correlated tests. Whereas no correction would be needed
because essentially only one independent test is performed, the
Bonferroni correction would still divide the significance level
by the number of tests carried out. Step-wise methods that
account for such correlated tests are most powerful. In these
instances, re-sampling [Westfall and Young, 1993] or alternative methods [Dudbridge and Koeleman, 2004; Lin, 2004;
Jung et al., 2005] for drawing repeated samples from the given
data or population suggested by the data may help to avoid
making assumptions about the joint distribution of the test
statistics under the null hypothesis and produce more accurate
control of the FWE.
The False Discovery Rate
Rather than controlling the probability that a study
produces one or more false discoveries, it can be argued that
it may be better to use a method that would produce the same
ratio of false discoveries divided by all rejected tests regardless
of systematic differences between studies. This would ensure
that in the long run, we obtain a desired ratio of false
discoveries to all reported discoveries in the literature. This
ratio is called the marginal FDR [Tsai et al., 2003]. The
marginal FDR can also be interpreted as the probability that a
randomly selected discovery from the literature is false. In
these cases, it is labeled as the false positive report probability
[Thomas and Clayton, 2004; Wacholder et al., 2004] or,
following Morton [1955], the proportion of false positives
[Fernando et al., 2004]. Although in this article, I confine
myself to a more frequentist approach to the false discovery
issue, it should also be noted that to a certain extent FDR also
allows you to be a frequentist and Bayesian at the same time
[Efron and Tibshirani, 2002].
The marginal FDR is closely related to indices such as
Benjamini and Hochberg’s [1995] FDR and Storey’s [2002]
positive false discovery rate (pFDR). Loosely speaking, the
marginal FDR is a theoretical construct that represents an
ideal goal. The FDR or pFDR can be viewed as tools that
researchers can use in practice to make decisions about which
tests to call significant to achieve that goal. For sake of
simplicity, however, we will use the term (marginal) FDR for
now and explain some of the differences between these
measures below.
The FDR is not merely another statistical technique and
differs in fundamental ways from the FWE. First, the FWE
focuses exclusively on the risk of false discoveries. Because this
risk is high in a genome-wide association study (GWAS) with
say 500k markers, large studies will be heavily penalized via
very small threshold P-values. However, large studies will not
only produce more false discoveries, they are also likely to
discover more true positives. The FDR ‘‘rewards’’ large studies
for finding more true discoveries by focusing on the proportion
of false discoveries divided by all rejected tests (including false
but also true discoveries). Considering true positives may
make sense in the context of finding genetic variants for
complex diseases. That is, due to small effect sizes, the power to
detect genes is already modest. Instead of further sacrificing
power, it may be better to allow an occasional false discovery to
improve the chances of finding effects. Furthermore, because
there will be multiple genes with small effects the consequences of a false discovery are not that severe. This would, for
example, be different for single gene disorders where a
discovery implies the strong claim that one has found the
cause, which has important scientific and clinical implications.
A second difference is that in contrast to the FWE, the
number of tests that are performed are not important for the
control of the (marginal) FDR. Instead, an important parameter is p0, which can either be interpreted as the proportion of
markers without effect on the disease or as the probability that
a randomly selected marker has no effect. The higher the
proportion of markers without effect, the more likely it is that a
significant result is a false discovery. This makes intuitive
sense: if p0 is one all discoveries are false, whereas if p0 is zero
none of the discoveries are false. A higher p0 therefore implies a
lower threshold P-value for obtaining the same FDR.
From a theoretical perspective, it seems more sensible to
base the control of false discoveries on p0 rather than the
number of tests carried out. For example, assume that 100,000
researchers each test a single marker. Because each researcher
performs only one test, from his or her perspective no
correction for multiple testing is necessary. However, if all
significant results were published the researchers together
would introduce 100,000 0.05 ¼ 5,000 false discoveries into
the literature. Furthermore, assume that one of the researchers would have had the budget to type all 100,000 markers. In
this case, s/he would then have applied a correction for multiple
testing and instead of 500 there would probably not be a single
false discovery. The basic problem is that the number of tests is
arbitrary depending on factors such as budget, publication
strategy, and genotyping capacity. It can therefore not be used
to control the accumulation of false discoveries in the literature
at a desired level. In contrast, parameter p0 is not arbitrary and
provides a better basis for applying similar standards to
different studies which is needed for controlling this accumulation. Thus, it is the fact that p0 is close to one in genetic
studies rather than the number of tests that creates the high
risk for false discoveries.
A numerical example. In Figure 1 we demonstrate the
relation between the FDR and FWE numerically assuming
Control of False Discoveries
p0 ¼ 0.99995. This value for p0 would mean that if we genotype
500k SNPs, 25 of them have effects. This would include possible
redundancy such as two markers that tag the same high risk
haplotype. This value of p0 corresponds reasonably might an
educated guess for a whole-genome association study
[Wacholder et al., 2004] and should be fairly robust because
of doubling (p0 ¼ 0.9999) or halving (p0 ¼ 0.999975) the number
of disease variants will only have a marginal effect.
Figure 1 shows the obtained FDR when we control the FWE
with the number of tests carried out indicated on the x-axis. We
assumed the goal of one false discovery every ten claims. More
conservative levels may result in a sharp increase in required
sample size [Van den Oord and Sullivan, 2003b] so that
FDR ¼ 0.1 seems to provide a reasonable balance between
controlling false discoveries and the sample size needed to
achieve that goal. We assumed the ideal technique for
controlling the FDR and FWE meaning that the proportion of
false positives introduced into the literature is on average 10%
and that the probability of producing one or more false
discoveries in a study is controlled exactly at the chosen
significance level of 0.05. The figure shows that controlling the
FWE results in a very low FDR when many tests are performed
and a very high FDR when few tests are performed. The low
FDR in Figure 1 obtained when the FWE is in studied where
many tests are performed is the main concern in large scale
association studies because many true effects may be missed. It
is sometimes suggested that for large scale genetic studies
controlling the FDR is very similar to FWE control. In general
this is incorrect because as shown in Figure 1, controlling the
FDR at say 0.1 results in a FDR lower than 0.1 but controlling
the FWE result in an FDR that is much smaller than 0.1. The
only exception is when p0 ¼ 1, which in the context of
association studies implies that the heritability of the disease
due to common variants is zero.
The high FDR in Figure 1 obtained when the FWE is
controlled in studies where few tests are performed is a
problem in (candidate gene) studies focusing on few markers.
This problem will be somewhat mitigated by the better p0 in
these studies. For example, assume that this ‘‘prior’’ information makes it a 100 times more likely that the selected SNPs are
associated with the disease, we have p0 ¼ 1–1,00 (1–
0.99995) ¼ 0.995. On the other hand, Figure 1 still assumes
FWE control that applies a correction for the number of tests
carried out. However, in candidate genes studies threshold
P-values are rarely adjusted. Consequently, discoveries from
such studies are most likely to be false and together these
studies will introduce a large number of false discoveries into
the literature [Colhoun et al., 2003; Van den Oord and
Sullivan, 2003b; Freimer and Sabatti, 2004; Wacholder et al.,
2004]. Realizing that the risk of false discoveries depends on p0
rather than de number of tests this argues for adjusting
P-values thresholds in these studies as well.
Practical Issues Related to FDR Control
Calculating P-values threshold that control the FDR
at a desired level. For linkage scans [Lander and Kruglyak,
1995] pre-specified threshold P-values pk are often used to
declare significance and pre-specified thresholds have also been
proposed for GWAS [Risch and Merikangas, 1996; Dahlman
et al., 2002; Blangero, 2004]. In principle, such pre-specified
thresholds can also be calculated in the context of the FDR:
pk ¼
ð1 p0 ÞAP
p0 =FDR p0
where p0 is the proportion of tests without effects, AP the
desired Average Power or proportion of markers with effects
one would like to detect that depends on effect sizes and sample
sizes, and FDR the desired ratio of false to all discoveries. These
calculations require assumptions (e.g., p0 and effect sizes). If
these assumptions are incorrect the desired FDR may not be
achieved. In general, it may therefore be better to use the
empirical FDR methods discussed below. The formula may,
however, still be helpful to design a study that is adequately
powered to control the FDR at a level while detecting a desired
proportion of markers with effect. In addition, if only few tests
are performed, empirical FDR methods may not work very well
and the use of pre-specified threshold P-values may be the only
option [Van den Oord and Sullivan, 2003b].
Fig. 1. The marginal FDR (y-axis) obtained after controlling the family-wise error rate with a different number of tests (x-axis). [Color figure can be
viewed in the online issue, which is available at]
van den Oord
TABLE I. Threshold P-Values Required for Controlling the FDR
at the 0.1 Level
Average power with threshold P-value
2.78 106
2.79 104
4.44 106
4.47 104
5.00 106
5.03 104
Table I reports threshold P-values that control the marginal
FDR at a level of 0.1 for scenarios that might be of practical
interest. Thus, threshold P-values of 5 106 would be
needed in a GWAS (p0 ¼ 0.99995) that would have good power
to detect effects with that threshold. This would generally be
more liberal compared to controlling the FWE in a GWAS with
500k to 1 million SNPs that would require P-value thresholds
in the range of 107 to 108. Furthermore, the table shows
that in a very good candidate gene study assuming that the
prior probability of selecting a marker with effect increased
100 times, p0 ¼ 0.995 ¼ 1–100 (1–0.99995), assuming, threshold P-values of 5 104 (e.g., 0.0005) may be needed to
control the FDR at an acceptable level. This threshold is, for
instance, considerably lower than the 0.05 commonly used in
practice to declare significance in candidate gene studies.
Q-values and sequential P-value methods. FDRs can
be estimated in multiple ways and many standard computer
packages (e.g., R, SAS) have such estimation procedures
implemented. The first approach is to estimate the FDR for a
chosen threshold P-value t. If the m P-values are denoted pi,
i ¼ 1. . .m, this can be done using the formula:
#fpi tg
Thus, the FDR is estimated by dividing the estimated
number of false discoveries (is number of tests times the
probability t of rejecting a marker without effect) by the total
number of significant markers (i.e., total number of P-values
smaller than t) that includes the false and true positives. To
avoid arbitrary choices, each of the observed P-values can be
used as a threshold P-value t. The resulting FDR statistics are
then called q-values [Storey, 2003; Storey and Tibshirani,
For other methods a researcher needs to choose the level at
which to control the FDR statistic. The method then estimates
the threshold P-value. For example, the so-called sequential
P-value method proposed by Benjamini and Hochberg [1995]
first sorts the P-values and then applies a simple rule to decide
which tests are significant [Benjamini and Hochberg, 1995]. It
may not be immediately transparent why this controls the
FDR, and we therefore provide supplementary material
showing that these sequential P-value methods perform the
same calculation as in (1). In addition to being very similar
form a theoretical perspective [Black, 2004; Storey et al., 2004],
q-value methods will probably also operate similarly in
practice because researchers will only report markers with
q-values below a certain FDR cut-off as discoveries.
FDR, pFDR, and local FDR. Controlling Benjamini and
Hochberg’s [1995] FDR at level q in studies where few tests are
performed may result in a proportion of false to all discoveries
in the literature that is much higher than q [Zaykin et al.,
2000]. The positive FDR (pFDR) attempts to correct for this
[Storey, 2002]. The disadvantage of the pFDR is that it requires
additional information to be estimated from the data. This may
in some situations offset its clearer interpretation and
theoretical appeal. As the number of tests carried out
increases, the different FDR indices will become equivalent
to each other and the marginal FDR [Storey, 2003; Tsai et al.,
2003]. How fast this happens depends on p0, average power AP,
and the level at which the FDR is controlled. In general, for
adequately powered GWAS involving hundreds of thousands of
markers there should be little difference. For studies testing a
marker set with a better p0 (e.g., candidate genes, replication
studies) 100–200 makers could suffice.
For a proper interpretation it is important to note that the
above FDRs averages the probabilities of being a false
discovery across all significant markers [Finner and Roters,
2001; Glonek and Soloman, 2003]. For instance, a marker may
have a 90% probability of being a false discovery but still be
significant at an FDR level of 0.1 because it was tested
simultaneously with unrelated markers having very low
probabilities. This also applies to q-values that at first glance
may seem to provide marker specific evidence. One consequence is that it is not possible to combine the FDRs from
different markers. For example, to examine whether a certain
biological pathway is involved, a researcher may want to
combine the evidence from all the genes in that pathway
[Aubert et al., 2004]. To quantify the probability that a specific
marker is a false discovery, we need to estimate so-called local
FDRs. These local FDRs, however, generally require data from
a large number of tests to be estimated reliably [Liao et al.,
Estimating p0 and the effect size. Parameter p0 is
unknown and it is not uncommon to assume p0 ¼ 1 in empirical
research. This will control the FDR conservatively as higher
values of p0 will result in smaller values of threshold P-value pk
(see Eq. 1). To avoid this conservative bias, p0 can also be
estimated from the data [Schweder and Spjøtvoll, 1982;
Benjamini and Hochberg, 2000; Mosig et al., 2001; Turkheimer
et al., 2001; Allison et al., 2002; Storey, 2002; Hsueh et al., 2003;
Pounds and Morris, 2003; Pounds and Cheng, 2004; Dalmasso
et al., 2005; Meinshausen and Rice, 2006; Efron et al., 2001].
For genetic studies the best estimators seem to be those that
take advantage of the knowledge that p0 has to be close to one
[Meinshausen and Rice, 2006; Kuo et al., 2007]. However, for
most standard situations where the FDR is controlled at a low
(say 0.1) level, the use of an estimate should no make too much
of a difference because an accurate estimate will be close to one.
In genetic studies there will typically be a large range of
effect sizes. In most cases such as GWAS we are (necessarily)
focusing on markers with effect above a certain threshold
rather than the substantial number of markers showing effects
that are real but too small to be reliably detected. It has even
been argued that these very small effects should perhaps better
be viewed as ‘‘null-markers’’ and that the FDR should be
controlled for effects above a certain threshold. For example,
FDR can be low because some of the markers out of the
potentially large pool of markers with very small effects are
significant. In an independent replication study these markers
are, however, unlikely to replicate due to low power. When the
FDR is controlled at say the 0.1 level, it may therefore be better
that this ratio pertains to the markers with effects above a
certain threshold only. In the context of expression arrays,
Efron [2004a] proposed a re-sampling technique for this
purpose that is drawing repeated samples from the data to
determine an ‘‘empirical’’ null distribution comprising both
true null markers plus the markers with very small effects.
However, in the context of genetic association studies where we
often have very good approximations to the tests statistic
distribution, a potentially more precise parametric approach to
control the FDR for effect above a certain threshold is also
conceivable [Bukszár and Van den Oord, 2007a,b].
Correlated tests due to linkage disequilibrium. In
genetics correlated tests can be expected because of linkage
disequilibrium between markers. Compared to methods to
control the FWE, FDR methods appear relatively robust
against the effects of correlated tests. This has been shown
Control of False Discoveries
theoretically for certain forms of dependence [Benjamini and
Hochberg, 1995; Storey, 2003; Tsai et al., 2003; Van den Oord
and Sullivan, 2003a; Fernando et al., 2004] and through
simulations [Brown and Russell, 1997; Korn et al., 2004]. As
the nature and size of the correlations play a role [Efron, 2006],
it is important to note that this robustness seems to generalize
to the context of genetic studies [Sabatti et al., 2003; Van den
Oord and Sullivan, 2003a; Van den Oord, 2005]. An intuitive
explanation for this robustness of FDR methods is that FDR
methods estimate of the ratio of false to total discoveries in a
study. Correlated tests mainly increase the variance of these
estimates. However, the FDR indices themselves that are the
means of these estimates tend to be robust.
Instead of merely testing for main effects, it may be
important to search for gene–environment/covariate interactions [Collins, 2004], gene–gene interactions [Carlborg and
Haley, 2004] or genetic variants affecting disease subtypes
[Kennedy et al., 2003]. As long as these searches are (1)
performed systematically for a certain model and (2) test
results can be summarized by a P-value, the above methods can
be used to control false discoveries.
However, very often such searches are not done systematically. For example, after failing to find main effects or replicate
a previously reported association, researchers may start
exploring interactions between genes and environmental
factors, test for effects in subgroups of patients, perform
multimarker and haplotype analyses, test for effects in subsets
of the whole sample, etc. The more extensive these searches,
the more likely that a ‘‘significant’’ finding will eventually be
obtained. The problem is that these models may fit or have
individual components (e.g., an interaction term) that seem
significant because they capitalize on random fluctuations in
the data. The ‘‘significant’’ findings are, however, deceptive
because they may be unlikely to replicate in independent data
sets. In this context, one could wonder how much of the pattern
of all the different ‘‘significant’’ findings for genes such as
Dysbindin are the result of the exploratory nature of some of
the replication studies. To properly assess whether results of
such exploratory searches will ‘‘replicate’’ in independent data
sets, we need to correct for the complexity of the search process
[Shen et al., 2004]. In addition, the complexity of the model
plays a role and needs to be taken into account as well. For
example, (regression) models that have many parameters by
comparison to the amount of data available may still explain a
substantial proportion of explained variance in the outcome
(i.e., show a good fit). Finally, the form of the model is relevant
because, even if the number of parameters is the same, models
differ in their ability to fit random data [Rissanen, 1978]. The
control of false discoveries after such (model) searches is an
active area of statistical research and in many instances
replication may currently the safest approach to validate
results obtained from such searches.
Rather than doing such searches manually, the options to
control false discoveries increase by using computer algorithms. However, even with computers, exhaustive searches
may not be possible. For example, for a two-locus, two-allele,
fully penetrant models with disease simply classified as absent
versus present, even with strictest definition of what is
essentially the same pattern there are already 58 nonredundant two-locus models [Li and Reich, 2000]. When either
the covariate or outcome variable are continuous the number of
models will increase dramatically because of possible nonlinear relations. One approach is to confine the analysis to a
specific model and then test this model systematically for all
markers. However, this may miss effects that involve all
the non-tested models. Alternatively, machine learning
approaches (e.g., data mining) can be used that attempt to
find other models. These methods typically perform searches in
an intelligent way and avoid considering all possible alternative models [Hastie et al., 2001; Hahn et al., 2003].
When searches are performed by computers, covariance
penalties can be used to account for the complexity of the model
and search process. Covariance penalties are related to the
degrees of freedom (of a test). Loosely speaking, covariance
penalties reflect the extent to which a model can fit random
data or, alternatively, the extent to which model fit depends
(i.e., covaries) on the random features of the data used to derive
the model. A fit index that is corrected by a covariance penalty
essentially estimates the fit of that model in an independent
replication data set. The most simple covariance penalties
merely penalize models for the number of parameters they
estimate. However, covariance penalties exist that also try to
capture the complexity of the form of the model and the search
process [Ye, 1998; Shen and Ye, 2002; Efron, 2004b]. For
example, Owen for example suggested using three degrees of
freedom for testing a single predictor term in a specific kind of
non-parametric regression model. More popular are the use
of cross-validation [Stone, 1974] and related techniques can be
viewed as non-parametric (i.e., based on few or weak statistical
assumptions) estimates of covariance penalties [Efron, 2004b].
However, it should be noted that these non-parametric
approaches can result in imprecise corrections of fit indices
[Efron, 2004b].
Searching for Models Using Biological Knowledge
Off-the-shelf data-mining and machine-learning techniques
may help to search for models but can produce artificial models
difficult to interpret from a biological perspective. So, it is
important to use available knowledge to constrain searches to
those models that are biologically meaningful. Although
knowledge is too limited to pre-specify explicit models linking
genotypes to phenotypes, we often do have partial information
[Strohman, 2002; Alon, 2003; Vazquez et al., 2004]. For
example, transcription networks comprise smaller substructures called motifs [Lee et al., 2002; Shen-Orr et al., 2002],
metabolic networks are subject to well-established organization principles [Kell, 2004], and genetic effects can be assumed
to be mediated by more or less coherent biological or pathogenic
processes that can be represented in models by latent variables
[Bollen, 2002; Van den Oord and Snieder, 2002; Van den Oord
et al., 2004]. Using specific machine learning techniques it is
possible to search through complex data sets efficiently while
incorporating such biological knowledge [Kell, 2002; Goodacre,
2005] thereby reducing the probability of false positive and
artificial findings.
Replication is generally perceived as a key step to rule out
false discoveries. The answer to the questions what constitutes a
replication and how best can it be achieved is, however, not
straightforward [Chanock et al., 2007]. An important issue is
whether it is necessary to require precise replication (the same
phenotype, genetic marker, genotype, statistical test, and
direction of association) of if less precise definitions of replication
suffice (e.g., any significant marker in the same gene) [Sullivan,
2007]. The problem with less precise definitions is that the
‘‘replication’’ study partly becomes an exploratory analysis and
subject to the above described phenomenon that the more
extensive the search process the more likely it is that results will
‘‘replicate.’’ Researchers sometimes justify less precise definitions based on the complexity of the genetics effects on the
psychiatric disorder (e.g., locus heterogeneity, different family
van den Oord
history, disease subtypes, or differences by genetic ancestry).
While such mechanisms might sometimes be true, these
explanations tend to minimize the possibility that a considerably
more parsimonious explanation is responsible for the results
(i.e., a false positive association) [Sullivan, 2007].
Even when a precise definition of replication is used,
declaring significance is not unambiguous. Rules such as
P-values smaller than 0.05 suggest replication are arbitrary.
That is, depending on factors such as effect sizes, sample sizes,
and the prior probability, this rule may result in significant
findings that have very different probabilities of being a true
replication. More meaningful decision rules will be needed
such as the use of threshold P-values ensuring that a desired
ratio of false discoveries to all reported discoveries in the
literature is achieved. A very simple approach of is to calculate
the local FDR (fdr) discussed in the previous section. That is,
each marker has potentially two states (1) it is related to the
disease and (2) it is unrelated to the disease. Given the test
result in the replication study we can now estimate the
(posterior) probability that it is unrelated to the disease.
fdrðiÞ ¼ PrðH0i ¼ truejT ¼ tiÞ ¼
p0ðiÞf0ðti Þ
p0i f 0 ðti Þ þ ð1 þ p0i Þfci ðti Þ
where H0i states the null hypothesis that marker i is unrelated
to the disease, ti is the value of test statistic T for marker i in the
replication study, f0 the density function under the null
distribution and f ci the density function under the alternative
distribution where ci is the (effect size) parameter for marker i
that affects the test statistic distribution under the alternative.
To estimate fdr(i) we can use the effect size we observe in the
replication study as the effects in the initial study is typically
biased upward [Goring et al., 2001; Ioannidis et al., 2001].
Although p0i can be estimated, a simple approach would be to
assume a range of values and examine for what value fdr
(i) would be sufficiently small (say 0.1). Although more
sophisticated method are conceivable, this simple method
would at least provide a better interpretation of what we call a
replication and makes an attempt to start standardizing the
criterion across studies for declaring significance.
Replication studies are often designed in an opportunistic
fashion (e.g., dictated by available sample). However, using the
(theory of) multistage designs, replication studies can be
designed in a cost-effective manner. In the multistage designs
as intended here, all the markers are only genotyped and tested
in Stage 1. The most promising markers are then genotyped in
Stage 2 using other/new samples in the second/replication
stage [Saito and Kamatani, 2002; Satagopan et al., 2002;
Aplenc et al., 2003; Satagopan and Elston, 2003; Van den Oord
and Sullivan, 2003a,b; Lowe et al., 2004]. The theory of
multistage designs allow you to calculate optimal simple sizes
for the replication part taking the sample size of Stage 1 into
account if that is not under the control of the investigator. In
addition, multistage designs offer the possibility to use
information collected at the first stage(s) to design the second
replication stage. In contrast, the design of single-stage large
scale association study is completely based on assumptions
about effect sizes, proportion of markers with effect etc. The
problem is that if these assumptions are incorrect the goals
may not be achieved or could have been achieved at much lower
cost. This idea of adaptive [Bauer and Brannath, 2004b] or selfdesigning studies [Fisher, 1998] where information from
earlier stages is used to improve the design of later stages is
for instance used in clinical trials. A simple example in the
present context would be to test for population stratification/
ascertainment bias in a Stage 1 case control sample and then
perform a family-based follow up study rather than another
case-control study if needed. Another example involves the use
of statistical procedures to ensure adequately powered follow
up studies or to determine the P-value threshold ensuring
that a sufficiently large proportion of markers with effect
are selected for the next stage. The latter is important because
whereas false discoveries can always be eliminated in future
studies, markers with effects that have been eliminated can
never be recovered. A final example involves the careful
integration of all the findings with those already out in the
literature to maximize the probability that the relevant
markers are selected for the next stage. Particularly for studies
as expensive as GWAS, it may be better to perform interim
analyses and adjust the study design if that turns out to be
necessary to achieve the goals or save costs.
Rather than analyzing replication data separately, a joint
analysis of the initial study and replication stage will give a
more powerful test. Standard tests cannot be used for such
combined data because only the markers that are significant in
the first stage are selected for the second replication stage and
the test statistics at both stages are dependent as a result of the
partly overlapping data [Satagopan et al., 2002; Van den Oord
and Sullivan, 2003a]. Simulations could be used instead [Lowe
et al., 2004]. However, testing in both stages at a significance
level of say a ¼ .001 implies that out of every million simulated
samples only one sample (1 million 0.001 0.001) will be
rejected at Stage 2. Thus, if one would like 1,000 rejected
samples at Stage 2 to estimate the critical values needed for
significance testing, a billion samples need to be simulated. If
available, theoretical approximations may be preferred such as
the use of a general approximation assuming (bi-variate)
normality of the test statistics at both stages [Satagopan et al.,
2002], when the test statistic is the difference between the
allele frequencies in cases and controls [Skol et al., 2006], and
when Pearson’s Chi-square statistic is used to analyze a
contingency table [Bukszár and Van den Oord, 2006]. However, the correct distribution of the test statistic may not be
known and combining the raw data may not always be possible
(e.g., if the first stage was done by other group or if the samples
in the two stages are different such as family based versus casecontrol studies). In these instances, researchers could resort to
combing the P-values across stages. Many techniques are
available in the meta-analysis literature for this purpose
[Sutton et al., 2000; Bauer and Brannath, 2004a] although
most of these techniques may need to be slightly modified to
account for the fact that in multistage designs there is a
selection of the most significant P-values in Stage 1.
Proper methods for controlling false discoveries are important to prevent that time and resources are spend on leads that
will eventually prove irrelevant. In the context of finding
common variants for complex diseases, an ideal goal may be
ideal to try to achieve the same ratio of false discoveries divided
by all rejected tests regardless of systematic differences
between studies. This ideal can never be achieved using
(traditional) methods where corrections for ‘‘multiple testing’’
are based on the number of tests carried out. This is because the
number of tests is arbitrary depending on factors such as
budget, publication strategy, and genotyping capacity. Instead
methods that control the FDR may be more suitable.
If the aim is to control the FDR control at a low level (say 0.1),
FDR control in GWAS will be relatively straightforward. That
is, a very simple method can be used that is likely to give result
very similar results to the more complex FDR variants (e.g.,
pFDR, include p0 estimates). Controlling the FDR in studies
where very few markers are tested (e.g., candidate gene
studies) is difficult. Rather then using an empirical method,
the use predetermined P-value threshold such as 5 104
may sometimes be the best option.
Control of False Discoveries
Exploratory analyses are potentially an important source of
false discoveries. The problem is to properly account for the
phenomenon that the more extensive the search process, the
more likely it is that a ‘‘significant’’ finding will be obtained.
Rather than performing such searches manually, the possibility to control false discoveries increases by using computer
algorithms. However, as statistical techniques are still developing to account for the complexities involved in controlling
false positives in exploratory searches, independent replication/validation may eventually be needed.
The answer to the questions what constitutes a replication
and how best can it be achieved is not straightforward. A strict
definition what constitutes a replication may be preferred to
avoid that the replication study becomes an exploratory study
and subject to the above described phenomenon that the more
extensive the search process the more likely it is that results
will ‘‘replicate.’’ Depending on factors such as effect sizes and
sample sizes, rules such as P-values smaller than 0.05 suggest
replication may result in significant findings that have very
different probabilities of being a true replication. Less
arbitrary decision rules, such as rules based on the posterior
probability that the replication is a false positive, will be
needed to help interpret replication findings and start standardizing the criterion across replication studies. Rather then
being deigned in an opportunistic fashion (e.g., dictated by
available sample) replication studies can be designed in a costeffective manner using the (theory of adaptive) multistage
Supplementary information is available in Edwin van den
Oord’s web site:
I would like to thank Joseph McClay for his comments on an
earlier draft of this article and Rebecca Ortiz for her help with
preparing the article.
Allison DB, Gadbury G, Heo M, Fernandez J, Lee C-K, Prolla TA, et al. 2002.
A mixture model approach for the analysis of microarray gene
expression data. Comput Stat Data Anal 39:1–20.
Alon U. 2003. Biological networks: The tinkerer as an engineer. Science
Aplenc R, Zhao H, Rebbeck TR, Propert KJ. 2003. Group sequential methods
and sample size savings in biomarker-disease association studies.
Genetics 163:1215–1219.
Aubert J, Bar-Hen A, Daudin JJ, Robin S. 2004. Determination of the
differentially expressed genes in microarray experiments using local
FDR. BMC Bioinformatics 5:125.
Bauer P, Brannath W. 2004a. The advantages and disadvantages of
adaptive designs for clinical trials. Drug Discov Today 9:351–357.
Brown BW, Russell K. 1997. Methods of correcting for multiple testing:
Operating characteristics. Stat Med 16:2511–2528.
Bukszár J, Van den Oord EJCG. 2006. Optimization of two-stage
genetic designs where data are combined using an accurate and
efficient approximation for Pearson’s statistic. Biometrics 62:1132–
Bukszár J, Van den Oord EJCG. 2007a. Estimating effect sizes in large scale
genetic association studies. Submitted for publication.
Bukszár J, Van den Oord EJCG. 2007b. Estimating the proportion of
markers without effect and average effect size in large scale genetic
association studies. Submitted for publication.
Carlborg O, Haley CS. 2004. Epistasis: Too often neglected in complex trait
studies? Nat Rev Genet 5:618–625.
Chanock SJ, Manolio T, Boehnke M, Boerwinkle E, Hunter DJ, Thomas G,
et al. 2007. Replicating genotype-phenotype associations. Nature
Colhoun HM, McKeigue PM, Davey SG. 2003. Problems of reporting genetic
associations with complex outcomes. Lancet 361:865–872.
Collins FS. 2004. The case for a US prospective cohort study of genes and
environment. Nature 429:475–477.
Dahlman I, Eaves IA, Kosoy R, Morrison VA, Heward J, Gough SC, et al.
2002. Parameters for reliable results in genetic association studies in
common disease. Nat Genet 30:149–150.
Dalmasso C, Broet P, Moreau T. 2005. A simple procedure for estimating the
false discovery rate. Bioinformatics 21:660–668.
Dudbridge F, Koeleman BP. 2004. Efficient computation of significance
levels for multiple associations in large studies of correlated data,
including genomewide association studies. Am J Hum Genet 75:424–
Dunnet CW, Tamhane AC. 1992. A step up multiple test procedure. J Am
Stat Assoc 87:162–170.
Efron B. 2004a. Large-scale simultaneous hypothesis testing: The choice of a
null hypothesis. J Am Stat Assoc 99:96–104.
Efron B. 2004b. The estimation of prediction error: Covariance penalties and
cross-validation. J Am Stat Assoc 99:619–632.
Efron. 2006. Correlation and Large-Scale Simultaneous Significance Testing. Stanford Technical Report.
Efron B, Tibshirani R. 2002. Empirical bayes methods and false discovery
rates for microarrays. Genet Epidemiol 23:70–86.
Efron B, Tibshirani R, Storey JD, Tusher VG. 2001. Empirical Bayes
analysis of a microarray experiment. J Am Stat Assoc 96:1151–1160.
Fernando RL, Nettleton D, Southey BR, Dekkers JC, Rothschild MF, Soller
M. 2004. Controlling the proportion of false positives in multiple
dependent tests. Genetics 166:611–619.
Finner H, Roters M. 2001. On the false discovery rate and expected Type I
errors. Biometrical J 8:985–1005.
Fisher LD. 1998. Self-designing clinical trials. Stat Med 17:1551–1562.
Freimer N, Sabatti C. 2004. The use of pedigree, sib-pair and association
studies of common diseases for genetic mapping and epidemiology. Nat
Genet 36:1045–1051.
Glonek G, Soloman P. 2003. Discussion of resampling-based multiple
testing for microarray data analysis by Ge, Dudoit and Speed. Test 12:
Goodacre R. 2005. Making sense of the metabolome using evolutionary
computation: Seeing the wood with the trees. J Exp Bot 56:245–254.
Bauer P, Brannath W. 2004b. The advantages and disadvantages of
adaptive designs for clinical trials. Drug Discov Today 9:351–357.
Goring HH, Terwilliger JD, Blangero J. 2001. Large upward bias in
estimation of locus-specific effects from genomewide scans. Am J Hum
Genet 69:1357–1369.
Benjamini Y, Hochberg Y. 1995. Controlling the false discovery rate: A
practical and powerful approach to multiple testing. J R Stat Soc B
Hahn LW, Ritchie MD, Moore JH. 2003. Multifactor dimensionality
reduction software for detecting gene-gene and gene-environment
interactions. Bioinformatics 19:376–382.
Benjamini Y, Hochberg Y. 2000. On adaptive control of the false discovery
rate in multiple testing with independent statistics. J Educ Behav Stat
Hastie T, Tibshirani R, Friedman J. 2001. The elements of statistical
learning: Data mining, inference, and prediction. New York: Springer
Black MA. 2004. A note on the adaptive control of false discovery rates. J R
Stat Soc B 66:297–304.
Hochberg Y, Benjamini Y. 1990. More powerful procedures for multiple
significance testing. Stat Med 9:811–818.
Blangero J. 2004. Localization and identification of human quantitative
trait loci: King harvest has surely come. Curr Opin Genet Dev 14:233–
Holm S. 1979. A simple sequentially rejective multiple test procedure. Scand
J Stat 6:65–70.
Bollen KA. 2002. Latent variables in psychology and the social sciences.
Annu Rev Psychol 53:605–634.
Hsueh H, Chen J, Kodell R. 2003. Comparison of methods for estimating the
number of true null hypotheses in multiplicity testing. J Biopharm Stat
van den Oord
Ioannidis JP, Ntzani EE, Trikalinos TA, Contopoulos-Ioannidis DG. 2001.
Replication validity of genetic association studies. Nat Genet 29:306–
Jung SH, Bang H, Young S. 2005. Sample size calculation for multiple
testing in microarray data analysis. Biostatistics 6:157–169.
Shen X, Ye J. 2002. Adaptive model selection. J Am Stat Assoc 97:210–221.
Shen X, Huang H, Ye J. 2004. The estimation of prediction error: Covariance
penalties and cross-validation: Comment. J Am Stat Assoc 99:634–
Kell DB. 2002. Genotype-phenotype mapping: Genes as computer programs.
Trends Genet 18:555–559.
Shen-Orr SS, Milo R, Mangan S, Alon U. 2002. Network motifs in the
transcriptional regulation network of Escherichia coli. Nat Genet
Kell DB. 2004. Metabolomics and systems biology: Making sense of the soup.
Curr Opin Microbiol 7:296–307.
Šidák Z. 1967. Rectangular confidence regions for the means of multivariate
distributions. J Am Stat Assoc 62:626–633.
Kennedy JL, Farrer LA, Andreasen NC, Mayeux R, George-Hyslop P. 2003.
The genetics of adult-onset neuropsychiatric disease: Complexities and
conundra? Science 302:822–826.
Skol AD, Scott LJ, Abecasis GR, Boehnke M. 2006. Joint analysis is more
efficient than replication-based analysis for two-stage genome-wide
association studies. Nat Genet 38:209–213.
Korn EL, Troendle J, McShane L, Simon R. 2004. Controlling the number of
false discoveries: Application to high-dimensional genomic data. J Stat
Plann Inference 124:379–398.
Stone M. 1974. Cross-validatory choice and assessment of statistical
predictions. J R Stat Soc B 36:111–147.
Kuo P, Bukszar J, Van den Oord EJCG. 2007. Estimating the number and
size of the main effects in genome-wide case-control association studies.
BMC Proc.
Lander E, Kruglyak L. 1995. Genetic dissection of complex traits: Guidelines
for interpreting and reporting linkage results. Nat Genet 11:241–247.
Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Joseph Z, Gerber GK, et al.
2002. Transcriptional regulatory networks in Saccharomyces cerevisiae.
Science 298:799–804.
Li W, Reich J. 2000. A complete enumeration and classification of two-locus
disease models. Hum Hered 50:334–349.
Liao JG, Lin Y, Selvanayagam ZE, Shih WJ. 2004. A mixture model for
estimating the local false discovery rate in DNA microarray analysis.
Bioinformatics 20:2694–2701.
Lin DY. 2004. An efficient Monte Carlo approach to assessing statistical
significance in genomic studies. Bioinformatics 21:781–787.
Lowe CE, Cooper JD, Chapman JM, Barratt BJ, Twells RC, Green EA, et al.
2004. Cost-effective analysis of candidate genes using htSNPs: A staged
approach. Genes Immun 5:301–305.
Meinshausen N, Rice J. 2006. Estimating the proportion of false null
hypotheses among a large number of independently tested hypotheses.
Ann Stat 34:373–393.
Morton NE. 1955. Sequential tests for the detection of linkage. Am J Hum
Genet 7:277–318.
Mosig MO, Lipkin E, Khutoreskaya G, Tchourzyna E, Soller M, Friedmann
A. 2001. A whole genome scan for quantitative trait loci affecting milk
protein percentage in Israeli-Holstein cattle, by means of selective milk
DNA pooling in a daughter design, using an adjusted false discovery rate
criterion. Genetics 157:1683–1698.
Pounds S, Cheng C. 2004. Improving false discovery rate estimation.
Bioinformatics 20:1737–1745.
Pounds S, Morris SW. 2003. Estimating the occurrence of false positives and
false negatives in microarray studies by approximating and partitioning
the empirical distribution of P-values. Bioinformatics 19:1236–1242.
Risch N, Merikangas K. 1996. The future of genetic studies of complex
human diseases. Science 273:1516–1517.
Rissanen J. 1978. Modeling by shortest data description. Automatica
Sabatti C, Service S, Freimer N. 2003. False discovery rate in linkage and
association genome screens for complex disorders. Genetics 164:829–
Saito A, Kamatani N. 2002. Strategies for genome-wide association studies:
Optimization of study designs by the stepwise focusing method. J Hum
Genet 47:360–365.
Satagopan JM, Elston RC. 2003. Optimal two-stage genotyping in
population-based association studies. Genet Epidemiol 25:149–157.
Satagopan JM, Verbel DA, Venkatraman ES, Offit KE, Begg CB. 2002. Twostage designs for gene-disease association studies. Biometrics 58:163–
Schweder T, Spjøtvoll E. 1982. Plots of P-values to evaluate many tests
simultaneously. Biometrika 69:493–502.
Storey J. 2002. A direct approach to false discovery rates. J R Stat Soc B
Storey J. 2003. The positive false discovery rate: A Bayesian interpretation
and the q-value. Ann Stat 31:2013–2035.
Storey J, Tibshirani R. 2003. Statistical significance for genome-wide
studies. Proc Natl Acad Sci 100:9440–9445.
Storey J, Taylor JE, Siegmund D. 2004. Strong control, conservative point
estimation and simultaneous conservative consistency of false discovery
rates: A unified approach. J R Stat Soc B 66:187–205.
Strohman R. 2002. Maneuvering in the complex path from genotype to
phenotype. Science 296:701–703.
Sullivan PF. 2007. Spurious genetic associations. Biol Psychiatry 61:1121–
Sutton A, Abrams K, Jones D, Sheldon T, Song F. 2000. Methods for metaanalysis in medical research.Chichester. UK: Wiley.
Thomas DC, Clayton DG. 2004. Betting odds and genetic associations. J Natl
Cancer Inst 96:421–423.
Tsai CA, Hsueh HM, Chen JJ. 2003. Estimation of false discovery rates in
multiple testing: Application to gene microarray data. Biometrics
Turkheimer FE, Smith CB, Schmidt K. 2001. Estimation of the number of
‘‘true’’ null hypotheses in multivariate analysis of neuroimaging data.
Neuroimage 13:920–930.
Van den Oord EJCG. 2005. Controlling false discoveries in candidate gene
studies. Mol Psychiatry 10:230–231.
Van den Oord EJCG, Snieder H. 2002. Including measured genotypes in
statistical models to study the interplay of multiple factors affecting
complex traits. Behav Genet 32:1–22.
Van den Oord EJCG, Sullivan PF. 2003a. A framework for controlling false
discovery rates and minimizing the amount of genotyping in the search
for disease mutations. Human Heredity 56:188–199.
Van den Oord EJCG, Sullivan PF. 2003b. False discoveries and models for
gene discovery. Trends Genet 19:537–542.
Van den Oord EJCG, MacGregor AJ, Snieder H, Spector TD. 2004. Modeling
with measured genotypes: Effects of the vitamin D receptor gene, age,
and latent genetic and environmental factors on measures of
bone mineral density. Behav Genet 34:197–206.
Vazquez A, Dobrin R, Sergi D, Eckmann JP, Oltvai ZN, Barabasi AL. 2004.
The topological relationship between the large-scale attributes and local
interaction patterns of complex networks. Proc Natl Acad Sci USA
Wacholder S, Chanock S, Garcia-Closas M, El Ghormli L, Rothman N. 2004.
Assessing the probability that a positive report is false: An approach for
molecular epidemiology studies. J Natl Cancer Inst 96:434–442.
Westfall P, Young SS. 1993. Resampling-based multiple testing. New York:
Ye J. 1998. On measuring and correcting the effects of data mining and
model selection. J Am Stat Assoc 93:120–131.
Zaykin D, Young S, Westfall P. 2000. Using the false discovery rate in the
genetic dissection of complex traits. Genetics 154:1917–1918.
Без категории
Размер файла
177 Кб
discovering, false, genetics, studies, controlling
Пожаловаться на содержимое документа