#PAGE_PARAMS# #ADS_HEAD_SCRIPTS# #MICRODATA#

Evaluating Statistical Methods Using Plasmode Data Sets in the Age of Massive Public Databases: An Illustration Using False Discovery Rates


Plasmode is a term coined several years ago to describe data sets that are derived from real data but for which some truth is known. Omic techniques, most especially microarray and genomewide association studies, have catalyzed a new zeitgeist of data sharing that is making data and data sets publicly available on an unprecedented scale. Coupling such data resources with a science of plasmode use would allow statistical methodologists to vet proposed techniques empirically (as opposed to only theoretically) and with data that are by definition realistic and representative. We illustrate the technique of empirical statistics by consideration of a common task when analyzing high dimensional data: the simultaneous testing of hundreds or thousands of hypotheses to determine which, if any, show statistical significance warranting follow-on research. The now-common practice of multiple testing in high dimensional experiment (HDE) settings has generated new methods for detecting statistically significant results. Although such methods have heretofore been subject to comparative performance analysis using simulated data, simulating data that realistically reflect data from an actual HDE remains a challenge. We describe a simulation procedure using actual data from an HDE where some truth regarding parameters of interest is known. We use the procedure to compare estimates for the proportion of true null hypotheses, the false discovery rate (FDR), and a local version of FDR obtained from 15 different statistical methods.


Vyšlo v časopise: Evaluating Statistical Methods Using Plasmode Data Sets in the Age of Massive Public Databases: An Illustration Using False Discovery Rates. PLoS Genet 4(6): e32767. doi:10.1371/journal.pgen.1000098
Kategorie: Research Article
prolekare.web.journal.doi_sk: https://doi.org/10.1371/journal.pgen.1000098

Souhrn

Plasmode is a term coined several years ago to describe data sets that are derived from real data but for which some truth is known. Omic techniques, most especially microarray and genomewide association studies, have catalyzed a new zeitgeist of data sharing that is making data and data sets publicly available on an unprecedented scale. Coupling such data resources with a science of plasmode use would allow statistical methodologists to vet proposed techniques empirically (as opposed to only theoretically) and with data that are by definition realistic and representative. We illustrate the technique of empirical statistics by consideration of a common task when analyzing high dimensional data: the simultaneous testing of hundreds or thousands of hypotheses to determine which, if any, show statistical significance warranting follow-on research. The now-common practice of multiple testing in high dimensional experiment (HDE) settings has generated new methods for detecting statistically significant results. Although such methods have heretofore been subject to comparative performance analysis using simulated data, simulating data that realistically reflect data from an actual HDE remains a challenge. We describe a simulation procedure using actual data from an HDE where some truth regarding parameters of interest is known. We use the procedure to compare estimates for the proportion of true null hypotheses, the false discovery rate (FDR), and a local version of FDR obtained from 15 different statistical methods.


Zdroje

1. AllisonDBCuiXPageGPSabripourM 2006 Microarray data analysis: From disarray to consolidation and consensus. Nat Rev Gen 7 55 65

2. MehtaTSTanikMAllisonDB 2004 Towards Sound Epistemological Foundations of Statistical Methods for High-Dimensional Biology. Nat Gen 36 943 947

3. CattellRBJasparsJ 1967 A general plasmode (No. 30-10-5-2) for factor analytic exercises and research. Multivariate Behav Res 67 1 212

4. SinghalSKyvernitisCGJohnsonSWKaiseraLRLeibmanMNAlbeldaSM 2003 Microarray data simulator for improved selection of differentially expressed genes. Cancer Biol Ther 2(4) 383 391

5. MehtaTSZakharkinSOGadburyGLAllisonDB 2006 Epistemological issues in omics and high-dimensional biology: Give the people what they want. Physiol Genomics 28 24 32

6. BenjaminiYHochbergY 1995 Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B 57 289 300

7. EfronB 2004 Large-scale simultaneous hypothesis testing: The choice of a null hypothesis. J Am Stat Assoc 99 96 104

8. WhitsettTLamartiniereCA 2006 Genistein and resveratrol: mammary cancer chemoprevention and mechanisms of action in the rat. Expert Rev Anticancer Ther 6(12) 1699 706

9. WhitsettTCarpenterDMLamartiniereCA 2006 Resveratrol, but not EGCG, in the diet suppresses DMBA-induced mammary cancer in rats. J Carcinog 5 15

10. KimKPageGPBeasleyTMBarnesSScheirerKEAllisonDB 2006 A proposed metric for assessing the measurement quality of individual microarrays. BMC Bioinformatics 7 35

11. PerssonSWeiHMilneJPageGPSomervilleCR 2005 Identification of genes required for cellulose synthesis by regression analysis of public microarray data sets. Proc Natl Acad Sci USA 102(24) 8633 8638

12. StoreyJD 2002 A direct approach to false discovery rates. J R Stat Soc Ser B 64 479 498

13. TsaiCHsuehHChenJJ 2003 Estimation of false discovery rates in multiple testing: Application to gene microarray data. Biometrics 59 1071 1081

14. BenjaminiYHochbergY 2000 On the adaptive control of the false discovery rate in multiple testing with independent statistics. J Ed Behav Stat 25 60 83

15. MosigMOLipkinaEKhutoreskayaaGTchourzynaaESolleraMFriedmannaA 2001 A whole genome scan for quantitative trait loci affecting milk protein percentage in Israeli-Holstein cattle, by means of selective milk DNA pooling in a daughter design, using an adjusted false discovery rate criterion. Genetics 157 1683 1698

16. StoreyJDTibshiraniR 2003 Statistical significance for genomewide studies. Proc Nat Acad Sci 100 9440 9445

17. StoreyJDTaylorJESiegmundD 2004 Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. J R Stat Soc Ser B 66 187 205

18. SchwederTSpjøtvollE 1982 Plots of p-values to evaluate many tests simultaneously. Biometrika 69 493 502

19. DalmassoCBroëtPMoreau TA 2005 A simple procedure for estimating the false discovery rate. Bioinformatics 21 660 668

20. LangaasMLindqvistBHFerkingstadE 2005 Estimating the proportion of true null hypotheses, with application to DNA microarray data. J R Stat Soc Ser B 67 555 572

21. ScheidSSpangRA 2004 Stochastic downhill search algorithm for estimating the local false discovery rate. IEEE/ACM Trans Compu Biol Bioinform 1 98 108

22. PoundsSMorrisSW 2003 Estimating the occurrence of false positive and false negative in microarray studies by approximating and partitioning the empirical distribution of p-values. Bioinformatics 19 1236 1242

23. PoundsSChengC 2004 Improving false discovery rate estimation. Bioinformatics 20 1737 1745

24. LiaoJGLinYSelvanayagamZEShihWJ 2004 A mixture model for estimating the local false discovery rate in DNA microarray analysis. Bioinformatics 20 2694 2701

25. BrobergP 2004 A new estimate of the proportion unchanged genes in a microarray experiment. Genome Biology 5 P10

26. AllisonDBGadburyGLHeoMFernanfezJRLeeCProllaTAWeindruchRA 2002 A mixture model approach for the analysis of microarray gene expression data. Compu Stat Data Anal 39 1 20

27. PoundsSChengC 2006 Robust estimation of the false discovery rate. Bioinformatics 22 1979 1987

28. MavromatisKIvanovaNBarryKShapiroHGoltsmanE 2007 Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nat Methods 4(6) 495 500

29. YuJPressoirGBriggsWHBiIVYamasakiM 2006 A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat Genetics 38(2) 203 208

30. ZhaoKAranzanaMJKimSListerCShindoC 2007 An arabidopsis example of association mapping in structured samples. PLoS Genet 3(1) e4

31. GadburyGLPageGPEdwardsJKayoTProllaTAWeindruchRPermanaPAMountzJDAllisonDB 2004 Power and sample size estimation in high dimensional biology. Stat Methods Med Res13 325 338

32. PageGPEdwardsJWGadburyGLYelisettiPWangJTrivediPAllisonDB 2006 The PowerAtlas: a power and sample size atlas for microarray experimental design and research. BMC Bioinformatics 7 84

33. HsuehHChenJJKodellRL 2003 Comparison of methods for estimating the number of true null hypotheses in multiplicity testing. J Biopharm Stat 13 675 689

34. NguyenD 2004 On estimating the proportion of true null hypotheses for false discovery rate controlling procedures in exploratory DNA microarray studies. Comp Stat Data Anal 47 611 637

35. NettletonDHwangGJTCaldroRAWiseRP 2006 Estimating the number of true null hypotheses from a histogram of p-values. J Agr Biol Environ Stat 337 356

36. BrobergPA 2005 A comparative review of estimates of the proportion unchanged genes and the false discovery rate. BMC Bioinformatics 6 199

37. YangJJYangMC 2006 An improved procedure for gene selection from microarray experiments using false discovery rate criterion. BMC Bioinformatics 7 15

Štítky
Genetika Reprodukčná medicína
Prihlásenie
Zabudnuté heslo

Zadajte e-mailovú adresu, s ktorou ste vytvárali účet. Budú Vám na ňu zasielané informácie k nastaveniu nového hesla.

Prihlásenie

Nemáte účet?  Registrujte sa

#ADS_BOTTOM_SCRIPTS#