American Journal of Medical Genetics Part B (Neuropsychiatric Genetics) 147B:637 –644 (2008) Review Article Controlling False Discoveries in Genetic Studies Edwin J.C.G. van den Oord1,2* 1 Center for Biomarker Research and Personalized Medicine, Medical College of Virginia, Virginia Commonwealth University, Richmond, Virginia 2 Virginia Institute for Psychiatric and Behavioral Genetics, Richmond, Virginia A false discovery occurs when a researcher concludes that a marker is involved in the etiology of the disease whereas in reality it is not. In genetic studies the risk of false discoveries is very high because only few among the many markers that can be tested will have an effect on the disease. In this article, we argue that it may be best to use methods for controlling false discoveries that would introduce the same ratio of false discoveries divided by all rejected tests into the literature regardless of systematic differences between studies. After a brief discussion of traditional ‘‘multiple testing’’ methods, we show that methods that control the false discovery rate (FDR) may be more suitable to achieve this goal. These FDR methods are therefore discussed in more detail. Instead of merely testing for main effects, it may be important to search for gene–environment/ covariate interactions, gene–gene interactions or genetic variants affecting disease subtypes. In the second section, we point out the challenges involved in controlling false discoveries in such searches. The final section discusses the role of replication studies for eliminating false discoveries and the complexities associated with the definition of what constitutes a replication and the design of these studies. ß 2007 Wiley-Liss, Inc. KEY WORDS: false discoveries; genome-wide association studies; multiple hypothesis testing; FDR; data mining; multistage designs Please cite this article as follows: van den Oord EJCG. 2008. Controlling False Discoveries in Genetic Studies. Am J Med Genet Part B 147B:637–644. INTRODUCTION A false discovery occurs when a researcher concludes that a marker is involved in the etiology of the disease whereas in reality it is not. In genetic studies the risk of a false discovery is very high because only few among all markers that can be tested will have an effect on the disease. Indeed, it has been speculated that 19 out of every 20 marker-disease associations currently reported in the literature are false [Colhoun et al., 2003]. Phenomena such as population stratification play a role but failure to exclude chance is the main cause of all these false discoveries. Proper methods for controlling false discoveries are important because they can prevent that a lot of time and resources are spend on leads that will eventually prove irrelevant and avoid a loss of confidence in research when many publicized ‘‘discoveries’’ are followed by non-replication. These methods may become even more important considering it has recently become possible to screen hundreds of thousands to a million single nucleotide polymorphisms (SNPs) across the whole genome for their association with a disease. Without proper control, the number of false discoveries will be proportional to the number of markers tested and the literature would be flooded with false discoveries. The question of how to best control false discoveries is therefore appropriate and timely. In the first section of this article we focus on significance testing. We argue that it may be best to use a method that would produce the same ratio of false discoveries divided by all rejected tests regardless of systematic differences between studies. This would ensure that in the long run, we obtain a desired ratio of false discoveries to all reported discoveries in the literature. After a brief discussion of traditional ‘‘multiple testing’’ methods, we show that methods that control the false discovery rate (FDR) may be more suitable to achieve this goal. These FDR methods are therefore discussed in more detail. Instead of merely testing for main effects, it may be important to search for gene–environment/covariate interactions, gene– gene interactions or genetic variants affecting disease subtypes. In the second section, we point out the challenges involved in controlling false discoveries in such searches. The control of false discoveries is not solely a data analysis problem and in the final section we argue that (the theory of) adaptive multistage designs may present advantages in the search for genetic variants affecting complex diseases. SIGNIFICANCE TESTING This article contains supplementary material, which may be viewed at the American Journal of Medical Genetics website at http://www.interscience.wiley.com/jpages/1552-4841/suppmat/ index.html. Grant sponsor: US National Institute of Mental Health; Grant number: R01 MH065320. *Correspondence to: Edwin J.C.G. van den Oord, Medical College of Virginia, Virginia Commonwealth University, P.O. Box 980533, Richmond, VA 23298-0533. E-mail: firstname.lastname@example.org Received 19 June 2007; Accepted 18 September 2007 DOI 10.1002/ajmg.b.30650 ß 2007 Wiley-Liss, Inc. Significance testing typically starts with calculating Pvalues for each marker. If the calculated P-value is smaller than a threshold P-value the null-hypothesis, assuming that the marker has no effect, is rejected and the test is called significant. A Type I error is the error of rejecting the nullhypothesis when it is true. This results in a false discovery or false positive. Controlling the Family Wise Error Rate Traditional approaches for controlling false discoveries attempt to maintain a desired probability that a study produces one or more false discoveries (see supplemental material for precise definitions of error rates discussed in this article.). This probability depends on the number of markers 638 van den Oord tested. For instance, if a single test is performed using a threshold P-value of 0.05, the probability of a false discovery is 5% if the marker has no effect. However, if 100,000 markers without effects are tested using the same 0.05 threshold, the probability of one or more false discoveries is close to one and the study may produce about 100,000 0.05 ¼ 5,000 false discoveries. To counteract this effect of performing multiple tests, the threshold P-value needs to be adjusted. In the Bonferroni correction, for example, the corrected P-value threshold equals the desired probability of producing one or more false discoveries divided by the number of tests carried out. Regardless of the number of tests, such as correction would ensure that less then one out of every 20 studies produces one or more false discoveries. Technically speaking traditional methods control the socalled family wise error-rate (FWE, these methods control the error rate for the whole set of ‘‘family’’ of tests). In the context of genome-wide scans this has been labeled the genome wise error-rate [Lander and Kruglyak, 1995; Risch and Merikangas, 1996]. Although the above discussed single step Bonferroni correction is probably the most well known procedure in this class, it controls the FWE too conservatively at a level smaller than a thereby sacrificing statistical power. The Šidák correction [Šidák, 1967] gives exact control of the FWE when none of the markers have an effect and are independent. If some of the markers have an effect, step-wise procedures are generally preferable. The idea is that once one of the null hypotheses is rejected it cannot any longer be considered true. We can therefore continue with correcting by a factor (m 1) rather than m. Holm’s step-down procedure [Holm, 1979] was one of the first step-wise procedures, but more powerful variants now exist [Hochberg and Benjamini, 1990; Dunnet and Tamhane, 1992]. In GWAS where the number of tests is very large and the number of true effects relatively very small, the use of the Šidák correction or a step-wise procedures is unlikely to have a substantial impact on the number of tests that are declared significant. The fact that control of the FWE is sensitive to correlated tests may present to biggest challenge. To illustrate the impact of correlated tests, assume 100 perfectly correlated tests. Whereas no correction would be needed because essentially only one independent test is performed, the Bonferroni correction would still divide the significance level by the number of tests carried out. Step-wise methods that account for such correlated tests are most powerful. In these instances, re-sampling [Westfall and Young, 1993] or alternative methods [Dudbridge and Koeleman, 2004; Lin, 2004; Jung et al., 2005] for drawing repeated samples from the given data or population suggested by the data may help to avoid making assumptions about the joint distribution of the test statistics under the null hypothesis and produce more accurate control of the FWE. The False Discovery Rate Rather than controlling the probability that a study produces one or more false discoveries, it can be argued that it may be better to use a method that would produce the same ratio of false discoveries divided by all rejected tests regardless of systematic differences between studies. This would ensure that in the long run, we obtain a desired ratio of false discoveries to all reported discoveries in the literature. This ratio is called the marginal FDR [Tsai et al., 2003]. The marginal FDR can also be interpreted as the probability that a randomly selected discovery from the literature is false. In these cases, it is labeled as the false positive report probability [Thomas and Clayton, 2004; Wacholder et al., 2004] or, following Morton , the proportion of false positives [Fernando et al., 2004]. Although in this article, I confine myself to a more frequentist approach to the false discovery issue, it should also be noted that to a certain extent FDR also allows you to be a frequentist and Bayesian at the same time [Efron and Tibshirani, 2002]. The marginal FDR is closely related to indices such as Benjamini and Hochberg’s  FDR and Storey’s  positive false discovery rate (pFDR). Loosely speaking, the marginal FDR is a theoretical construct that represents an ideal goal. The FDR or pFDR can be viewed as tools that researchers can use in practice to make decisions about which tests to call significant to achieve that goal. For sake of simplicity, however, we will use the term (marginal) FDR for now and explain some of the differences between these measures below. The FDR is not merely another statistical technique and differs in fundamental ways from the FWE. First, the FWE focuses exclusively on the risk of false discoveries. Because this risk is high in a genome-wide association study (GWAS) with say 500k markers, large studies will be heavily penalized via very small threshold P-values. However, large studies will not only produce more false discoveries, they are also likely to discover more true positives. The FDR ‘‘rewards’’ large studies for finding more true discoveries by focusing on the proportion of false discoveries divided by all rejected tests (including false but also true discoveries). Considering true positives may make sense in the context of finding genetic variants for complex diseases. That is, due to small effect sizes, the power to detect genes is already modest. Instead of further sacrificing power, it may be better to allow an occasional false discovery to improve the chances of finding effects. Furthermore, because there will be multiple genes with small effects the consequences of a false discovery are not that severe. This would, for example, be different for single gene disorders where a discovery implies the strong claim that one has found the cause, which has important scientific and clinical implications. A second difference is that in contrast to the FWE, the number of tests that are performed are not important for the control of the (marginal) FDR. Instead, an important parameter is p0, which can either be interpreted as the proportion of markers without effect on the disease or as the probability that a randomly selected marker has no effect. The higher the proportion of markers without effect, the more likely it is that a significant result is a false discovery. This makes intuitive sense: if p0 is one all discoveries are false, whereas if p0 is zero none of the discoveries are false. A higher p0 therefore implies a lower threshold P-value for obtaining the same FDR. From a theoretical perspective, it seems more sensible to base the control of false discoveries on p0 rather than the number of tests carried out. For example, assume that 100,000 researchers each test a single marker. Because each researcher performs only one test, from his or her perspective no correction for multiple testing is necessary. However, if all significant results were published the researchers together would introduce 100,000 0.05 ¼ 5,000 false discoveries into the literature. Furthermore, assume that one of the researchers would have had the budget to type all 100,000 markers. In this case, s/he would then have applied a correction for multiple testing and instead of 500 there would probably not be a single false discovery. The basic problem is that the number of tests is arbitrary depending on factors such as budget, publication strategy, and genotyping capacity. It can therefore not be used to control the accumulation of false discoveries in the literature at a desired level. In contrast, parameter p0 is not arbitrary and provides a better basis for applying similar standards to different studies which is needed for controlling this accumulation. Thus, it is the fact that p0 is close to one in genetic studies rather than the number of tests that creates the high risk for false discoveries. A numerical example. In Figure 1 we demonstrate the relation between the FDR and FWE numerically assuming Control of False Discoveries p0 ¼ 0.99995. This value for p0 would mean that if we genotype 500k SNPs, 25 of them have effects. This would include possible redundancy such as two markers that tag the same high risk haplotype. This value of p0 corresponds reasonably might an educated guess for a whole-genome association study [Wacholder et al., 2004] and should be fairly robust because of doubling (p0 ¼ 0.9999) or halving (p0 ¼ 0.999975) the number of disease variants will only have a marginal effect. Figure 1 shows the obtained FDR when we control the FWE with the number of tests carried out indicated on the x-axis. We assumed the goal of one false discovery every ten claims. More conservative levels may result in a sharp increase in required sample size [Van den Oord and Sullivan, 2003b] so that FDR ¼ 0.1 seems to provide a reasonable balance between controlling false discoveries and the sample size needed to achieve that goal. We assumed the ideal technique for controlling the FDR and FWE meaning that the proportion of false positives introduced into the literature is on average 10% and that the probability of producing one or more false discoveries in a study is controlled exactly at the chosen significance level of 0.05. The figure shows that controlling the FWE results in a very low FDR when many tests are performed and a very high FDR when few tests are performed. The low FDR in Figure 1 obtained when the FWE is in studied where many tests are performed is the main concern in large scale association studies because many true effects may be missed. It is sometimes suggested that for large scale genetic studies controlling the FDR is very similar to FWE control. In general this is incorrect because as shown in Figure 1, controlling the FDR at say 0.1 results in a FDR lower than 0.1 but controlling the FWE result in an FDR that is much smaller than 0.1. The only exception is when p0 ¼ 1, which in the context of association studies implies that the heritability of the disease due to common variants is zero. The high FDR in Figure 1 obtained when the FWE is controlled in studies where few tests are performed is a problem in (candidate gene) studies focusing on few markers. This problem will be somewhat mitigated by the better p0 in these studies. For example, assume that this ‘‘prior’’ information makes it a 100 times more likely that the selected SNPs are 639 associated with the disease, we have p0 ¼ 1–1,00 (1– 0.99995) ¼ 0.995. On the other hand, Figure 1 still assumes FWE control that applies a correction for the number of tests carried out. However, in candidate genes studies threshold P-values are rarely adjusted. Consequently, discoveries from such studies are most likely to be false and together these studies will introduce a large number of false discoveries into the literature [Colhoun et al., 2003; Van den Oord and Sullivan, 2003b; Freimer and Sabatti, 2004; Wacholder et al., 2004]. Realizing that the risk of false discoveries depends on p0 rather than de number of tests this argues for adjusting P-values thresholds in these studies as well. Practical Issues Related to FDR Control Calculating P-values threshold that control the FDR at a desired level. For linkage scans [Lander and Kruglyak, 1995] pre-specified threshold P-values pk are often used to declare significance and pre-specified thresholds have also been proposed for GWAS [Risch and Merikangas, 1996; Dahlman et al., 2002; Blangero, 2004]. In principle, such pre-specified thresholds can also be calculated in the context of the FDR: pk ¼ ð1 p0 ÞAP p0 =FDR p0 ð1Þ where p0 is the proportion of tests without effects, AP the desired Average Power or proportion of markers with effects one would like to detect that depends on effect sizes and sample sizes, and FDR the desired ratio of false to all discoveries. These calculations require assumptions (e.g., p0 and effect sizes). If these assumptions are incorrect the desired FDR may not be achieved. In general, it may therefore be better to use the empirical FDR methods discussed below. The formula may, however, still be helpful to design a study that is adequately powered to control the FDR at a level while detecting a desired proportion of markers with effect. In addition, if only few tests are performed, empirical FDR methods may not work very well and the use of pre-specified threshold P-values may be the only option [Van den Oord and Sullivan, 2003b]. Fig. 1. The marginal FDR (y-axis) obtained after controlling the family-wise error rate with a different number of tests (x-axis). [Color figure can be viewed in the online issue, which is available at www.interscience.wiley.com.] 640 van den Oord TABLE I. Threshold P-Values Required for Controlling the FDR at the 0.1 Level Average power with threshold P-value p0 0.99995 0.995 0.5 2.78 106 2.79 104 0.8 4.44 106 4.47 104 0.9 5.00 106 5.03 104 Table I reports threshold P-values that control the marginal FDR at a level of 0.1 for scenarios that might be of practical interest. Thus, threshold P-values of 5 106 would be needed in a GWAS (p0 ¼ 0.99995) that would have good power to detect effects with that threshold. This would generally be more liberal compared to controlling the FWE in a GWAS with 500k to 1 million SNPs that would require P-value thresholds in the range of 107 to 108. Furthermore, the table shows that in a very good candidate gene study assuming that the prior probability of selecting a marker with effect increased 100 times, p0 ¼ 0.995 ¼ 1–100 (1–0.99995), assuming, threshold P-values of 5 104 (e.g., 0.0005) may be needed to control the FDR at an acceptable level. This threshold is, for instance, considerably lower than the 0.05 commonly used in practice to declare significance in candidate gene studies. Q-values and sequential P-value methods. FDRs can be estimated in multiple ways and many standard computer packages (e.g., R, SAS) have such estimation procedures implemented. The first approach is to estimate the FDR for a chosen threshold P-value t. If the m P-values are denoted pi, i ¼ 1. . .m, this can be done using the formula: d FDRðtÞ ¼ mt #fpi tg ð2Þ Thus, the FDR is estimated by dividing the estimated number of false discoveries (is number of tests times the probability t of rejecting a marker without effect) by the total number of significant markers (i.e., total number of P-values smaller than t) that includes the false and true positives. To avoid arbitrary choices, each of the observed P-values can be used as a threshold P-value t. The resulting FDR statistics are then called q-values [Storey, 2003; Storey and Tibshirani, 2003]. For other methods a researcher needs to choose the level at which to control the FDR statistic. The method then estimates the threshold P-value. For example, the so-called sequential P-value method proposed by Benjamini and Hochberg  first sorts the P-values and then applies a simple rule to decide which tests are significant [Benjamini and Hochberg, 1995]. It may not be immediately transparent why this controls the FDR, and we therefore provide supplementary material showing that these sequential P-value methods perform the same calculation as in (1). In addition to being very similar form a theoretical perspective [Black, 2004; Storey et al., 2004], q-value methods will probably also operate similarly in practice because researchers will only report markers with q-values below a certain FDR cut-off as discoveries. FDR, pFDR, and local FDR. Controlling Benjamini and Hochberg’s  FDR at level q in studies where few tests are performed may result in a proportion of false to all discoveries in the literature that is much higher than q [Zaykin et al., 2000]. The positive FDR (pFDR) attempts to correct for this [Storey, 2002]. The disadvantage of the pFDR is that it requires additional information to be estimated from the data. This may in some situations offset its clearer interpretation and theoretical appeal. As the number of tests carried out increases, the different FDR indices will become equivalent to each other and the marginal FDR [Storey, 2003; Tsai et al., 2003]. How fast this happens depends on p0, average power AP, and the level at which the FDR is controlled. In general, for adequately powered GWAS involving hundreds of thousands of markers there should be little difference. For studies testing a marker set with a better p0 (e.g., candidate genes, replication studies) 100–200 makers could suffice. For a proper interpretation it is important to note that the above FDRs averages the probabilities of being a false discovery across all significant markers [Finner and Roters, 2001; Glonek and Soloman, 2003]. For instance, a marker may have a 90% probability of being a false discovery but still be significant at an FDR level of 0.1 because it was tested simultaneously with unrelated markers having very low probabilities. This also applies to q-values that at first glance may seem to provide marker specific evidence. One consequence is that it is not possible to combine the FDRs from different markers. For example, to examine whether a certain biological pathway is involved, a researcher may want to combine the evidence from all the genes in that pathway [Aubert et al., 2004]. To quantify the probability that a specific marker is a false discovery, we need to estimate so-called local FDRs. These local FDRs, however, generally require data from a large number of tests to be estimated reliably [Liao et al., 2004]. Estimating p0 and the effect size. Parameter p0 is unknown and it is not uncommon to assume p0 ¼ 1 in empirical research. This will control the FDR conservatively as higher values of p0 will result in smaller values of threshold P-value pk (see Eq. 1). To avoid this conservative bias, p0 can also be estimated from the data [Schweder and Spjøtvoll, 1982; Benjamini and Hochberg, 2000; Mosig et al., 2001; Turkheimer et al., 2001; Allison et al., 2002; Storey, 2002; Hsueh et al., 2003; Pounds and Morris, 2003; Pounds and Cheng, 2004; Dalmasso et al., 2005; Meinshausen and Rice, 2006; Efron et al., 2001]. For genetic studies the best estimators seem to be those that take advantage of the knowledge that p0 has to be close to one [Meinshausen and Rice, 2006; Kuo et al., 2007]. However, for most standard situations where the FDR is controlled at a low (say 0.1) level, the use of an estimate should no make too much of a difference because an accurate estimate will be close to one. In genetic studies there will typically be a large range of effect sizes. In most cases such as GWAS we are (necessarily) focusing on markers with effect above a certain threshold rather than the substantial number of markers showing effects that are real but too small to be reliably detected. It has even been argued that these very small effects should perhaps better be viewed as ‘‘null-markers’’ and that the FDR should be controlled for effects above a certain threshold. For example, FDR can be low because some of the markers out of the potentially large pool of markers with very small effects are significant. In an independent replication study these markers are, however, unlikely to replicate due to low power. When the FDR is controlled at say the 0.1 level, it may therefore be better that this ratio pertains to the markers with effects above a certain threshold only. In the context of expression arrays, Efron [2004a] proposed a re-sampling technique for this purpose that is drawing repeated samples from the data to determine an ‘‘empirical’’ null distribution comprising both true null markers plus the markers with very small effects. However, in the context of genetic association studies where we often have very good approximations to the tests statistic distribution, a potentially more precise parametric approach to control the FDR for effect above a certain threshold is also conceivable [Bukszár and Van den Oord, 2007a,b]. Correlated tests due to linkage disequilibrium. In genetics correlated tests can be expected because of linkage disequilibrium between markers. Compared to methods to control the FWE, FDR methods appear relatively robust against the effects of correlated tests. This has been shown Control of False Discoveries theoretically for certain forms of dependence [Benjamini and Hochberg, 1995; Storey, 2003; Tsai et al., 2003; Van den Oord and Sullivan, 2003a; Fernando et al., 2004] and through simulations [Brown and Russell, 1997; Korn et al., 2004]. As the nature and size of the correlations play a role [Efron, 2006], it is important to note that this robustness seems to generalize to the context of genetic studies [Sabatti et al., 2003; Van den Oord and Sullivan, 2003a; Van den Oord, 2005]. An intuitive explanation for this robustness of FDR methods is that FDR methods estimate of the ratio of false to total discoveries in a study. Correlated tests mainly increase the variance of these estimates. However, the FDR indices themselves that are the means of these estimates tend to be robust. EXPLORATORY ANALYSES, DATA MINING, AND MODEL DISCOVERY Instead of merely testing for main effects, it may be important to search for gene–environment/covariate interactions [Collins, 2004], gene–gene interactions [Carlborg and Haley, 2004] or genetic variants affecting disease subtypes [Kennedy et al., 2003]. As long as these searches are (1) performed systematically for a certain model and (2) test results can be summarized by a P-value, the above methods can be used to control false discoveries. However, very often such searches are not done systematically. For example, after failing to find main effects or replicate a previously reported association, researchers may start exploring interactions between genes and environmental factors, test for effects in subgroups of patients, perform multimarker and haplotype analyses, test for effects in subsets of the whole sample, etc. The more extensive these searches, the more likely that a ‘‘significant’’ finding will eventually be obtained. The problem is that these models may fit or have individual components (e.g., an interaction term) that seem significant because they capitalize on random fluctuations in the data. The ‘‘significant’’ findings are, however, deceptive because they may be unlikely to replicate in independent data sets. In this context, one could wonder how much of the pattern of all the different ‘‘significant’’ findings for genes such as Dysbindin are the result of the exploratory nature of some of the replication studies. To properly assess whether results of such exploratory searches will ‘‘replicate’’ in independent data sets, we need to correct for the complexity of the search process [Shen et al., 2004]. In addition, the complexity of the model plays a role and needs to be taken into account as well. For example, (regression) models that have many parameters by comparison to the amount of data available may still explain a substantial proportion of explained variance in the outcome (i.e., show a good fit). Finally, the form of the model is relevant because, even if the number of parameters is the same, models differ in their ability to fit random data [Rissanen, 1978]. The control of false discoveries after such (model) searches is an active area of statistical research and in many instances replication may currently the safest approach to validate results obtained from such searches. Rather than doing such searches manually, the options to control false discoveries increase by using computer algorithms. However, even with computers, exhaustive searches may not be possible. For example, for a two-locus, two-allele, fully penetrant models with disease simply classified as absent versus present, even with strictest definition of what is essentially the same pattern there are already 58 nonredundant two-locus models [Li and Reich, 2000]. When either the covariate or outcome variable are continuous the number of models will increase dramatically because of possible nonlinear relations. One approach is to confine the analysis to a specific model and then test this model systematically for all markers. However, this may miss effects that involve all 641 the non-tested models. Alternatively, machine learning approaches (e.g., data mining) can be used that attempt to find other models. These methods typically perform searches in an intelligent way and avoid considering all possible alternative models [Hastie et al., 2001; Hahn et al., 2003]. When searches are performed by computers, covariance penalties can be used to account for the complexity of the model and search process. Covariance penalties are related to the degrees of freedom (of a test). Loosely speaking, covariance penalties reflect the extent to which a model can fit random data or, alternatively, the extent to which model fit depends (i.e., covaries) on the random features of the data used to derive the model. A fit index that is corrected by a covariance penalty essentially estimates the fit of that model in an independent replication data set. The most simple covariance penalties merely penalize models for the number of parameters they estimate. However, covariance penalties exist that also try to capture the complexity of the form of the model and the search process [Ye, 1998; Shen and Ye, 2002; Efron, 2004b]. For example, Owen for example suggested using three degrees of freedom for testing a single predictor term in a specific kind of non-parametric regression model. More popular are the use of cross-validation [Stone, 1974] and related techniques can be viewed as non-parametric (i.e., based on few or weak statistical assumptions) estimates of covariance penalties [Efron, 2004b]. However, it should be noted that these non-parametric approaches can result in imprecise corrections of fit indices [Efron, 2004b]. Searching for Models Using Biological Knowledge Off-the-shelf data-mining and machine-learning techniques may help to search for models but can produce artificial models difficult to interpret from a biological perspective. So, it is important to use available knowledge to constrain searches to those models that are biologically meaningful. Although knowledge is too limited to pre-specify explicit models linking genotypes to phenotypes, we often do have partial information [Strohman, 2002; Alon, 2003; Vazquez et al., 2004]. For example, transcription networks comprise smaller substructures called motifs [Lee et al., 2002; Shen-Orr et al., 2002], metabolic networks are subject to well-established organization principles [Kell, 2004], and genetic effects can be assumed to be mediated by more or less coherent biological or pathogenic processes that can be represented in models by latent variables [Bollen, 2002; Van den Oord and Snieder, 2002; Van den Oord et al., 2004]. Using specific machine learning techniques it is possible to search through complex data sets efficiently while incorporating such biological knowledge [Kell, 2002; Goodacre, 2005] thereby reducing the probability of false positive and artificial findings. REPLICATION STUDIES Replication is generally perceived as a key step to rule out false discoveries. The answer to the questions what constitutes a replication and how best can it be achieved is, however, not straightforward [Chanock et al., 2007]. An important issue is whether it is necessary to require precise replication (the same phenotype, genetic marker, genotype, statistical test, and direction of association) of if less precise definitions of replication suffice (e.g., any significant marker in the same gene) [Sullivan, 2007]. The problem with less precise definitions is that the ‘‘replication’’ study partly becomes an exploratory analysis and subject to the above described phenomenon that the more extensive the search process the more likely it is that results will ‘‘replicate.’’ Researchers sometimes justify less precise definitions based on the complexity of the genetics effects on the psychiatric disorder (e.g., locus heterogeneity, different family 642 van den Oord history, disease subtypes, or differences by genetic ancestry). While such mechanisms might sometimes be true, these explanations tend to minimize the possibility that a considerably more parsimonious explanation is responsible for the results (i.e., a false positive association) [Sullivan, 2007]. Even when a precise definition of replication is used, declaring significance is not unambiguous. Rules such as P-values smaller than 0.05 suggest replication are arbitrary. That is, depending on factors such as effect sizes, sample sizes, and the prior probability, this rule may result in significant findings that have very different probabilities of being a true replication. More meaningful decision rules will be needed such as the use of threshold P-values ensuring that a desired ratio of false discoveries to all reported discoveries in the literature is achieved. A very simple approach of is to calculate the local FDR (fdr) discussed in the previous section. That is, each marker has potentially two states (1) it is related to the disease and (2) it is unrelated to the disease. Given the test result in the replication study we can now estimate the (posterior) probability that it is unrelated to the disease. fdrðiÞ ¼ PrðH0i ¼ truejT ¼ tiÞ ¼ p0ðiÞf0ðti Þ ð3Þ p0i f 0 ðti Þ þ ð1 þ p0i Þfci ðti Þ where H0i states the null hypothesis that marker i is unrelated to the disease, ti is the value of test statistic T for marker i in the replication study, f0 the density function under the null distribution and f ci the density function under the alternative distribution where ci is the (effect size) parameter for marker i that affects the test statistic distribution under the alternative. To estimate fdr(i) we can use the effect size we observe in the replication study as the effects in the initial study is typically biased upward [Goring et al., 2001; Ioannidis et al., 2001]. Although p0i can be estimated, a simple approach would be to assume a range of values and examine for what value fdr (i) would be sufficiently small (say 0.1). Although more sophisticated method are conceivable, this simple method would at least provide a better interpretation of what we call a replication and makes an attempt to start standardizing the criterion across studies for declaring significance. Replication studies are often designed in an opportunistic fashion (e.g., dictated by available sample). However, using the (theory of) multistage designs, replication studies can be designed in a cost-effective manner. In the multistage designs as intended here, all the markers are only genotyped and tested in Stage 1. The most promising markers are then genotyped in Stage 2 using other/new samples in the second/replication stage [Saito and Kamatani, 2002; Satagopan et al., 2002; Aplenc et al., 2003; Satagopan and Elston, 2003; Van den Oord and Sullivan, 2003a,b; Lowe et al., 2004]. The theory of multistage designs allow you to calculate optimal simple sizes for the replication part taking the sample size of Stage 1 into account if that is not under the control of the investigator. In addition, multistage designs offer the possibility to use information collected at the first stage(s) to design the second replication stage. In contrast, the design of single-stage large scale association study is completely based on assumptions about effect sizes, proportion of markers with effect etc. The problem is that if these assumptions are incorrect the goals may not be achieved or could have been achieved at much lower cost. This idea of adaptive [Bauer and Brannath, 2004b] or selfdesigning studies [Fisher, 1998] where information from earlier stages is used to improve the design of later stages is for instance used in clinical trials. A simple example in the present context would be to test for population stratification/ ascertainment bias in a Stage 1 case control sample and then perform a family-based follow up study rather than another case-control study if needed. Another example involves the use of statistical procedures to ensure adequately powered follow up studies or to determine the P-value threshold ensuring that a sufficiently large proportion of markers with effect are selected for the next stage. The latter is important because whereas false discoveries can always be eliminated in future studies, markers with effects that have been eliminated can never be recovered. A final example involves the careful integration of all the findings with those already out in the literature to maximize the probability that the relevant markers are selected for the next stage. Particularly for studies as expensive as GWAS, it may be better to perform interim analyses and adjust the study design if that turns out to be necessary to achieve the goals or save costs. Rather than analyzing replication data separately, a joint analysis of the initial study and replication stage will give a more powerful test. Standard tests cannot be used for such combined data because only the markers that are significant in the first stage are selected for the second replication stage and the test statistics at both stages are dependent as a result of the partly overlapping data [Satagopan et al., 2002; Van den Oord and Sullivan, 2003a]. Simulations could be used instead [Lowe et al., 2004]. However, testing in both stages at a significance level of say a ¼ .001 implies that out of every million simulated samples only one sample (1 million 0.001 0.001) will be rejected at Stage 2. Thus, if one would like 1,000 rejected samples at Stage 2 to estimate the critical values needed for significance testing, a billion samples need to be simulated. If available, theoretical approximations may be preferred such as the use of a general approximation assuming (bi-variate) normality of the test statistics at both stages [Satagopan et al., 2002], when the test statistic is the difference between the allele frequencies in cases and controls [Skol et al., 2006], and when Pearson’s Chi-square statistic is used to analyze a contingency table [Bukszár and Van den Oord, 2006]. However, the correct distribution of the test statistic may not be known and combining the raw data may not always be possible (e.g., if the first stage was done by other group or if the samples in the two stages are different such as family based versus casecontrol studies). In these instances, researchers could resort to combing the P-values across stages. Many techniques are available in the meta-analysis literature for this purpose [Sutton et al., 2000; Bauer and Brannath, 2004a] although most of these techniques may need to be slightly modified to account for the fact that in multistage designs there is a selection of the most significant P-values in Stage 1. CONCLUSIONS Proper methods for controlling false discoveries are important to prevent that time and resources are spend on leads that will eventually prove irrelevant. In the context of finding common variants for complex diseases, an ideal goal may be ideal to try to achieve the same ratio of false discoveries divided by all rejected tests regardless of systematic differences between studies. This ideal can never be achieved using (traditional) methods where corrections for ‘‘multiple testing’’ are based on the number of tests carried out. This is because the number of tests is arbitrary depending on factors such as budget, publication strategy, and genotyping capacity. Instead methods that control the FDR may be more suitable. If the aim is to control the FDR control at a low level (say 0.1), FDR control in GWAS will be relatively straightforward. That is, a very simple method can be used that is likely to give result very similar results to the more complex FDR variants (e.g., pFDR, include p0 estimates). Controlling the FDR in studies where very few markers are tested (e.g., candidate gene studies) is difficult. Rather then using an empirical method, the use predetermined P-value threshold such as 5 104 may sometimes be the best option. Control of False Discoveries Exploratory analyses are potentially an important source of false discoveries. The problem is to properly account for the phenomenon that the more extensive the search process, the more likely it is that a ‘‘significant’’ finding will be obtained. Rather than performing such searches manually, the possibility to control false discoveries increases by using computer algorithms. However, as statistical techniques are still developing to account for the complexities involved in controlling false positives in exploratory searches, independent replication/validation may eventually be needed. The answer to the questions what constitutes a replication and how best can it be achieved is not straightforward. A strict definition what constitutes a replication may be preferred to avoid that the replication study becomes an exploratory study and subject to the above described phenomenon that the more extensive the search process the more likely it is that results will ‘‘replicate.’’ Depending on factors such as effect sizes and sample sizes, rules such as P-values smaller than 0.05 suggest replication may result in significant findings that have very different probabilities of being a true replication. Less arbitrary decision rules, such as rules based on the posterior probability that the replication is a false positive, will be needed to help interpret replication findings and start standardizing the criterion across replication studies. Rather then being deigned in an opportunistic fashion (e.g., dictated by available sample) replication studies can be designed in a costeffective manner using the (theory of adaptive) multistage designs. ONLINE LINKS Supplementary information is available in Edwin van den Oord’s web site: http://www.vipbg.vcu.edu/edwin/. ACKNOWLEDGMENTS I would like to thank Joseph McClay for his comments on an earlier draft of this article and Rebecca Ortiz for her help with preparing the article. REFERENCES Allison DB, Gadbury G, Heo M, Fernandez J, Lee C-K, Prolla TA, et al. 2002. A mixture model approach for the analysis of microarray gene expression data. Comput Stat Data Anal 39:1–20. Alon U. 2003. Biological networks: The tinkerer as an engineer. Science 301:1866–1867. Aplenc R, Zhao H, Rebbeck TR, Propert KJ. 2003. Group sequential methods and sample size savings in biomarker-disease association studies. Genetics 163:1215–1219. Aubert J, Bar-Hen A, Daudin JJ, Robin S. 2004. Determination of the differentially expressed genes in microarray experiments using local FDR. BMC Bioinformatics 5:125. Bauer P, Brannath W. 2004a. The advantages and disadvantages of adaptive designs for clinical trials. Drug Discov Today 9:351–357. 643 Brown BW, Russell K. 1997. Methods of correcting for multiple testing: Operating characteristics. Stat Med 16:2511–2528. Bukszár J, Van den Oord EJCG. 2006. Optimization of two-stage genetic designs where data are combined using an accurate and efficient approximation for Pearson’s statistic. Biometrics 62:1132– 1137. Bukszár J, Van den Oord EJCG. 2007a. Estimating effect sizes in large scale genetic association studies. Submitted for publication. Bukszár J, Van den Oord EJCG. 2007b. Estimating the proportion of markers without effect and average effect size in large scale genetic association studies. Submitted for publication. Carlborg O, Haley CS. 2004. Epistasis: Too often neglected in complex trait studies? Nat Rev Genet 5:618–625. Chanock SJ, Manolio T, Boehnke M, Boerwinkle E, Hunter DJ, Thomas G, et al. 2007. Replicating genotype-phenotype associations. Nature 447:655–660. Colhoun HM, McKeigue PM, Davey SG. 2003. Problems of reporting genetic associations with complex outcomes. Lancet 361:865–872. Collins FS. 2004. The case for a US prospective cohort study of genes and environment. Nature 429:475–477. Dahlman I, Eaves IA, Kosoy R, Morrison VA, Heward J, Gough SC, et al. 2002. Parameters for reliable results in genetic association studies in common disease. Nat Genet 30:149–150. Dalmasso C, Broet P, Moreau T. 2005. A simple procedure for estimating the false discovery rate. Bioinformatics 21:660–668. Dudbridge F, Koeleman BP. 2004. Efficient computation of significance levels for multiple associations in large studies of correlated data, including genomewide association studies. Am J Hum Genet 75:424– 435. Dunnet CW, Tamhane AC. 1992. A step up multiple test procedure. J Am Stat Assoc 87:162–170. Efron B. 2004a. Large-scale simultaneous hypothesis testing: The choice of a null hypothesis. J Am Stat Assoc 99:96–104. Efron B. 2004b. The estimation of prediction error: Covariance penalties and cross-validation. J Am Stat Assoc 99:619–632. Efron. 2006. Correlation and Large-Scale Simultaneous Significance Testing. Stanford Technical Report. Efron B, Tibshirani R. 2002. Empirical bayes methods and false discovery rates for microarrays. Genet Epidemiol 23:70–86. Efron B, Tibshirani R, Storey JD, Tusher VG. 2001. Empirical Bayes analysis of a microarray experiment. J Am Stat Assoc 96:1151–1160. Fernando RL, Nettleton D, Southey BR, Dekkers JC, Rothschild MF, Soller M. 2004. Controlling the proportion of false positives in multiple dependent tests. Genetics 166:611–619. Finner H, Roters M. 2001. On the false discovery rate and expected Type I errors. Biometrical J 8:985–1005. Fisher LD. 1998. Self-designing clinical trials. Stat Med 17:1551–1562. Freimer N, Sabatti C. 2004. The use of pedigree, sib-pair and association studies of common diseases for genetic mapping and epidemiology. Nat Genet 36:1045–1051. Glonek G, Soloman P. 2003. Discussion of resampling-based multiple testing for microarray data analysis by Ge, Dudoit and Speed. Test 12: 1–77. Goodacre R. 2005. Making sense of the metabolome using evolutionary computation: Seeing the wood with the trees. J Exp Bot 56:245–254. Bauer P, Brannath W. 2004b. The advantages and disadvantages of adaptive designs for clinical trials. Drug Discov Today 9:351–357. Goring HH, Terwilliger JD, Blangero J. 2001. Large upward bias in estimation of locus-specific effects from genomewide scans. Am J Hum Genet 69:1357–1369. Benjamini Y, Hochberg Y. 1995. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J R Stat Soc B 57:289–300. Hahn LW, Ritchie MD, Moore JH. 2003. Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions. Bioinformatics 19:376–382. Benjamini Y, Hochberg Y. 2000. On adaptive control of the false discovery rate in multiple testing with independent statistics. J Educ Behav Stat 25:60–83. Hastie T, Tibshirani R, Friedman J. 2001. The elements of statistical learning: Data mining, inference, and prediction. New York: Springer Verlag. Black MA. 2004. A note on the adaptive control of false discovery rates. J R Stat Soc B 66:297–304. Hochberg Y, Benjamini Y. 1990. More powerful procedures for multiple significance testing. Stat Med 9:811–818. Blangero J. 2004. Localization and identification of human quantitative trait loci: King harvest has surely come. Curr Opin Genet Dev 14:233– 240. Holm S. 1979. A simple sequentially rejective multiple test procedure. Scand J Stat 6:65–70. Bollen KA. 2002. Latent variables in psychology and the social sciences. Annu Rev Psychol 53:605–634. Hsueh H, Chen J, Kodell R. 2003. Comparison of methods for estimating the number of true null hypotheses in multiplicity testing. J Biopharm Stat 13:675–689. 644 van den Oord Ioannidis JP, Ntzani EE, Trikalinos TA, Contopoulos-Ioannidis DG. 2001. Replication validity of genetic association studies. Nat Genet 29:306– 309. Jung SH, Bang H, Young S. 2005. Sample size calculation for multiple testing in microarray data analysis. Biostatistics 6:157–169. Shen X, Ye J. 2002. Adaptive model selection. J Am Stat Assoc 97:210–221. Shen X, Huang H, Ye J. 2004. The estimation of prediction error: Covariance penalties and cross-validation: Comment. J Am Stat Assoc 99:634– 637. Kell DB. 2002. Genotype-phenotype mapping: Genes as computer programs. Trends Genet 18:555–559. Shen-Orr SS, Milo R, Mangan S, Alon U. 2002. Network motifs in the transcriptional regulation network of Escherichia coli. Nat Genet 31:64–68. Kell DB. 2004. Metabolomics and systems biology: Making sense of the soup. Curr Opin Microbiol 7:296–307. Šidák Z. 1967. Rectangular confidence regions for the means of multivariate distributions. J Am Stat Assoc 62:626–633. Kennedy JL, Farrer LA, Andreasen NC, Mayeux R, George-Hyslop P. 2003. The genetics of adult-onset neuropsychiatric disease: Complexities and conundra? Science 302:822–826. Skol AD, Scott LJ, Abecasis GR, Boehnke M. 2006. Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nat Genet 38:209–213. Korn EL, Troendle J, McShane L, Simon R. 2004. Controlling the number of false discoveries: Application to high-dimensional genomic data. J Stat Plann Inference 124:379–398. Stone M. 1974. Cross-validatory choice and assessment of statistical predictions. J R Stat Soc B 36:111–147. Kuo P, Bukszar J, Van den Oord EJCG. 2007. Estimating the number and size of the main effects in genome-wide case-control association studies. BMC Proc. Lander E, Kruglyak L. 1995. Genetic dissection of complex traits: Guidelines for interpreting and reporting linkage results. Nat Genet 11:241–247. Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Joseph Z, Gerber GK, et al. 2002. Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 298:799–804. Li W, Reich J. 2000. A complete enumeration and classification of two-locus disease models. Hum Hered 50:334–349. Liao JG, Lin Y, Selvanayagam ZE, Shih WJ. 2004. A mixture model for estimating the local false discovery rate in DNA microarray analysis. Bioinformatics 20:2694–2701. Lin DY. 2004. An efficient Monte Carlo approach to assessing statistical significance in genomic studies. Bioinformatics 21:781–787. Lowe CE, Cooper JD, Chapman JM, Barratt BJ, Twells RC, Green EA, et al. 2004. Cost-effective analysis of candidate genes using htSNPs: A staged approach. Genes Immun 5:301–305. Meinshausen N, Rice J. 2006. Estimating the proportion of false null hypotheses among a large number of independently tested hypotheses. Ann Stat 34:373–393. Morton NE. 1955. Sequential tests for the detection of linkage. Am J Hum Genet 7:277–318. Mosig MO, Lipkin E, Khutoreskaya G, Tchourzyna E, Soller M, Friedmann A. 2001. A whole genome scan for quantitative trait loci affecting milk protein percentage in Israeli-Holstein cattle, by means of selective milk DNA pooling in a daughter design, using an adjusted false discovery rate criterion. Genetics 157:1683–1698. Pounds S, Cheng C. 2004. Improving false discovery rate estimation. Bioinformatics 20:1737–1745. Pounds S, Morris SW. 2003. Estimating the occurrence of false positives and false negatives in microarray studies by approximating and partitioning the empirical distribution of P-values. Bioinformatics 19:1236–1242. Risch N, Merikangas K. 1996. The future of genetic studies of complex human diseases. Science 273:1516–1517. Rissanen J. 1978. Modeling by shortest data description. Automatica 14:465–471. Sabatti C, Service S, Freimer N. 2003. False discovery rate in linkage and association genome screens for complex disorders. Genetics 164:829– 833. Saito A, Kamatani N. 2002. Strategies for genome-wide association studies: Optimization of study designs by the stepwise focusing method. J Hum Genet 47:360–365. Satagopan JM, Elston RC. 2003. Optimal two-stage genotyping in population-based association studies. Genet Epidemiol 25:149–157. Satagopan JM, Verbel DA, Venkatraman ES, Offit KE, Begg CB. 2002. Twostage designs for gene-disease association studies. Biometrics 58:163– 170. Schweder T, Spjøtvoll E. 1982. Plots of P-values to evaluate many tests simultaneously. Biometrika 69:493–502. Storey J. 2002. A direct approach to false discovery rates. J R Stat Soc B 64:479–498. Storey J. 2003. The positive false discovery rate: A Bayesian interpretation and the q-value. Ann Stat 31:2013–2035. Storey J, Tibshirani R. 2003. Statistical significance for genome-wide studies. Proc Natl Acad Sci 100:9440–9445. Storey J, Taylor JE, Siegmund D. 2004. Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: A unified approach. J R Stat Soc B 66:187–205. Strohman R. 2002. Maneuvering in the complex path from genotype to phenotype. Science 296:701–703. Sullivan PF. 2007. Spurious genetic associations. Biol Psychiatry 61:1121– 1126. Sutton A, Abrams K, Jones D, Sheldon T, Song F. 2000. Methods for metaanalysis in medical research.Chichester. UK: Wiley. Thomas DC, Clayton DG. 2004. Betting odds and genetic associations. J Natl Cancer Inst 96:421–423. Tsai CA, Hsueh HM, Chen JJ. 2003. Estimation of false discovery rates in multiple testing: Application to gene microarray data. Biometrics 59:1071–1081. Turkheimer FE, Smith CB, Schmidt K. 2001. Estimation of the number of ‘‘true’’ null hypotheses in multivariate analysis of neuroimaging data. Neuroimage 13:920–930. Van den Oord EJCG. 2005. Controlling false discoveries in candidate gene studies. Mol Psychiatry 10:230–231. Van den Oord EJCG, Snieder H. 2002. Including measured genotypes in statistical models to study the interplay of multiple factors affecting complex traits. Behav Genet 32:1–22. Van den Oord EJCG, Sullivan PF. 2003a. A framework for controlling false discovery rates and minimizing the amount of genotyping in the search for disease mutations. Human Heredity 56:188–199. Van den Oord EJCG, Sullivan PF. 2003b. False discoveries and models for gene discovery. Trends Genet 19:537–542. Van den Oord EJCG, MacGregor AJ, Snieder H, Spector TD. 2004. Modeling with measured genotypes: Effects of the vitamin D receptor gene, age, and latent genetic and environmental factors on measures of bone mineral density. Behav Genet 34:197–206. Vazquez A, Dobrin R, Sergi D, Eckmann JP, Oltvai ZN, Barabasi AL. 2004. The topological relationship between the large-scale attributes and local interaction patterns of complex networks. Proc Natl Acad Sci USA 101:17940–17945. Wacholder S, Chanock S, Garcia-Closas M, El Ghormli L, Rothman N. 2004. Assessing the probability that a positive report is false: An approach for molecular epidemiology studies. J Natl Cancer Inst 96:434–442. Westfall P, Young SS. 1993. Resampling-based multiple testing. New York: Wiley. Ye J. 1998. On measuring and correcting the effects of data mining and model selection. J Am Stat Assoc 93:120–131. Zaykin D, Young S, Westfall P. 2000. Using the false discovery rate in the genetic dissection of complex traits. Genetics 154:1917–1918.