Добавил:
kiopkiopkiop18@yandex.ru t.me/Prokururor I Вовсе не секретарь, но почту проверяю Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Ординатура / Офтальмология / Английские материалы / Myopia Animal Models to Clinical Trials_Beuerman, Saw, Tan_2009.pdf
Скачиваний:
0
Добавлен:
28.03.2026
Размер:
3.4 Mб
Скачать

219 Statistical Analysis of Genome-wide Association Studies for Myopia

of refractive errors, and unaffected status is defined when both eyes do not reach the threshold. However, for quantitative phenotypes, we face the question whether the analysis should be conducted for each eye, independently, or a summary form of the two eyes. Furthermore, how should we interpret results if right and left eyes lead to discordant findings? We will discuss this topic further in the correlated phenotype section.

Study Design

Statistical methods have strong influences on the study design of any research projects. This is the same for association studies in the human genetic field. Two primary study designs are case-control and family-based association studies. The former uses unrelated population-based casecontrol samples, and the latter uses familial samples such as parents and full-siblings.

Case-control design is known to have a higher statistical power than family-based design under the same sample size, but the results are strongly influenced by sample selection.16 In addition, case-control design tends to be prone to spurious association results due to undetected latent population structure. However, the reservation of using case-control study design has been changed significantly since the development of several sophisticated statistical methods, including GENOMIC CONTROL,17 STRUCTURE,18 and EIGENSTRAT,19 which can minimize the effect of population structure to the association tests of unrelated samples. While these two study designs still possess distinct advantages over each other, sample availability is often the primary driving force for the selection of the study design. The epidemiology study of myopia has a larger and earlier research community than the genetic study of myopia. Several epidemiology cohorts of myopia have specimen collected from participants, which provide a great resource for myopia genetic research. Likewise, similar to most of GWA studies published to date, most GWA studies for myopia will be primarily based on population-based samples due to the sample resources and the total genotyping cost.

Although the cost per genotype is decreasing, the total cost of a GWA study is still high. Two-stage or multi-stage design has been proposed for the purpose of retaining statistical power and reducing the genotyping cost.20–22 The idea of two-stage design is to conduct the GWA study in the first set of samples at the reasonable and affordable size (first-stage), and

220 Y.J. Li and Q. Fan

follow-up a subset of SNPs in another independent samples (secondstage). Skol et al.20 suggested performing joint analysis for pooled samples from both stages in order to maximize the power of detecting significance, which is different from the conventional view of replication that considers the second stage results as an independent dataset. Regardless which analytical approaches were taken, the best practice of declaring GWA findings is to seek out replication evidence in as many independent datasets as possible.

Genotyping and Quality Controls

GWA studies rely on commercial SNP chips, predominantly by Affymetrix

(http://www.affymetrix.com/) and Illumina (http://www.illumina.com/). The current available SNP chips (>300 K SNPs) all have the ability to

detect copy number variants (CNVs), which refer to the chromosomal deletions or duplications. This makes GWA studies more attractive as one can investigate both SNP and CNV association with phenotypes of interest at the same time. The most commonly used criteria for selecting SNP chips is the global coverage across the genome, that is, the fraction of common SNPs that are tagged by the SNPs on the chips.23–25 The latest products, Affymetrix Human SNP array 6.0 and Illumina HamanOmni1-Quad, are indeed aiming for this goal with the dramatically increased number of SNPs on the chips compared to their earlier products. The Affymetrix 6.0 includes more than 906,600 single nucleotide polymorphisms (SNPs) and more than 946,000 additional probes for the detection of copy number variation (CNVs). The Illumina HumanOmni1-Quad BeadChip, a completely redesigned array of HumanlM-Duo, contains over 1 million markers, including aggressively selected SNPs and probes from all three HapMap phases, the 1000 Genome project (http://www.1000genomes. org/page.php), and published studies. Specifically, it contains ~18 K SNPs targeting four 1 Mb regions known to be associated with human diseases; over 62 K non-synonymous SNPs; and SNPs targeting new coding variants. This chip has a median spacing of 1.5 kb to ensure high resolution for CNV detection. Although studies using these newly marketed SNP chips have not been reported, this level of global coverage will indeed increase the power for GWA studies.

Regardless what types of SNP chips are used, a rigorous quality control (QC) procedure is very important to ensuring the success of the study.

221 Statistical Analysis of Genome-wide Association Studies for Myopia

While both Affymetrix and Illumina have their own genotype calling algorithms for raw data analysis, one should make sure that the best practice of genotype calling protocol is applied.26 Several QC check points are often examined in the GWA study, including sample call rate, Hardy–Weinberg equilibrium (HWE) for each marker using control samples, minor allele frequency (MAF), genotype missingness per marker, and population structure. Although there is no gold standard for these QC check points, examples of thresholds that we would recommend are: excluding samples with call rates <96% or <98%,26 and excluding SNPs that are out of HWE ( p < 10–7) in control samples, MAF < 0.01, or genotype missingness > 10%. Population structure is another important QC task to investigate. More details are described in the next section.

Population Structure

Early views of the role of population structure in genetic association studies of unrelated individuals focused on the concern that cryptic population substructure would raise the false positive rate of statistical tests above their nominal level. For instance, in a case-control dataset, assume there are two underlying subpopulations with different allele frequencies at the SNP and the number of cases is disproportionally high in one subpopulation. Under this scenario, failure to account for population stratification, a confounding factor of allele frequencies differences, could result in significant false positive association between SNPs and disease status.

A conventional approach is to select a homogenous dataset as best as you can at the design stage, such as matching cases and controls to minimize the population stratification effect. However, most studies have subtle stratification or have difficulty matching epidemiological or environmental background. Interestingly, the GWA study from the Wellcome Trust Case Control Consortium (WTCCC) study has demonstrated that as long as cases and controls are well matched for broad ethnic backgrounds and solid exclusion criteria are in place, the impact of residual substructure has minimum effect on type I error.2

Concern over the false positive rates by population-based association studies has led to a number of different approaches to control the presence of population structure, including “genomic control,”17 clustering methods, such as STRUCTURE and STRAT methods,18,27 principle components analysis (PCA), and alternative family-based study designs.

222 Y.J. Li and Q. Fan

Genomic control requires data on null SNPs to estimate a variance inflation factor that is used to directly correct the test statistic.28 However, the assumption that the inflation factor is globally consistent across a whole genome might simplify the variations between SNPs. Hence adjustment for the inflation factor is preferred after addressing population substructure in a certain level, such as applying the GC method in a dataset that may have similar background (less heterogeneous).

A different approach, implemented in STRUCTURE and STRAT software,18,27 (http://pritch.bsd.uchicago.edu/software.html), uses a multistage approach that first identifies any subpopulations using unlinked markers, assigns individuals to putative subpopulations, and then uses subpopulation clusters as a covariate in tests for association with disease phenotype. The STRUCTURE method is extremely computationally demanding. One assumption that we need to make for the STRUCTURE analysis is the number of potential subpopulations in the dataset. Under the given K subpopulations, detected the probability of the membership within each subpopulation is estimated. Therefore, results may be different for a different given number of K, and a STRUCTURE analysis may need to be run a few times to tune the K parameter.

Reich et al.29 proposed a feasible computational approach to detect and correct population stratification. In their approach, PCA is used to model ancestry difference between case and control. The EIGENSTRAT approach identifies ancestry differences among samples along eigenvectors of a covariates matrix. For instance, Fig. 1 depicts the relationship among the first three principle components (PC1, PC2, and PC3) in SCORM dataset, which implies five outliers to be excluded from further association analyses. In addition to excluding these samples, the EIGENSTRAT approach is to adjust the amounts attributable to ancestry for the top eigenvectors (http://genepath.med.harvard.edu/~reich/Software.htm). Patterson et al.30 pointed out that top eigenvectors could be caused by a large set of markers in a high (or complete) LD block. Hence, they recommend pruning the markers in tight LD before performing PCA.

Association Tests

As association studies have dominated human genetics in the past decade, many association methods (family-based or case-control based) have been developed (e.g. Refs. 31–33). The analysis strategies for GWA data are generally the same as those for candidate gene association studies,

223 Statistical Analysis of Genome-wide Association Studies for Myopia

 

PC1 vs. PC2

 

 

 

PC1 vs. PC3

 

 

 

0.10

 

 

 

 

 

 

 

0.05

 

 

 

0.5

 

 

 

 

 

 

 

 

 

eigenvector 2

0.00

 

 

eigenvector 3

0.0

 

 

 

−0.05

 

 

 

 

 

 

 

−0.10

 

 

 

 

 

 

 

 

 

 

 

−0.5

 

 

 

−0.15

 

 

 

 

 

 

 

−0.15 −0.10 −0.05 0.00

0.05

0.10

 

−0.15 −0.10 −0.05 0.00

0.05

0.10

 

eigenvector 1

 

 

 

eigenvector 1

 

 

Figure 1. The first three principle components (PC1, PC2, and PC3) in SCORM dataset.

determined by the study design and the type of phenotypes to be tested. However, new challenges are forthcoming as well due to the large amount of data derived from a GWA study. One key consideration is whether one can perform efficient association analysis in a reasonable timeframe. A few free and commercial computer programs, such as PLINK (http:// pngu.mgh.harvard.edu/~purcell/plink/), Golden Helix (http://www. helixtree.com/index.html), and Syllego (http://www.rosettabio.com/ products/syllego) were, therefore, developed for processing genome wide SNP data, including data management and analyses.

PLINK, in particular, a whole genome association analysis tool set, is a popular and widely used free program, which has evolved fast not just to handle genome wide SNP data but also CNV data. A series of modules for data management, quality control checks, population stratification, association analysis, etc, are implemented in PLINK. Most importantly, PLINK decodes pedigree file (pedigree and genotype information) to the binary format, which significantly decreases the computational time for a genome wide scan and makes PLINK an efficient tool for GWA studies.

224 Y.J. Li and Q. Fan

Depending on the phenotype and hypothesis to be tested, different association methods can be applied. For unrelated population-based samples, Fisher exact test, Cochran–Armitage trend test, and logistic regression are commonly used for qualitative traits, and the linear model is used for quantitative traits. All these methods have been implemented in PLINK. Both logistic regression and the linear model have the flexibility of incorporating covariates that may have confounding effects to the phenotype.

For family-based data, since the development of the TDT method,31 many extensions of TDT methods or new family-based association methods and computer programs were developed, including PDT,32 FBAT,34 APL,35 QTDT,36–40 just to name a few. Although these computer programs with family-based association tests have been used extensively in association studies, computational time is a concern for the GWA setting.

While PLINK is an efficient tool, the family-based association methods implemented in PLINK are limited to TDT, parent-TDT, and parent-of- origin using parent-offspring triad dataset for qualitative traits, and Qfam for quantitative traits. Qfam is an ad hoc procedure analogous to the between/within orthogonal model proposed by Fulker et al.39 and Abecasis et al.40 that was implemented in the QTDT package (http://www.sph. umich.edu/csg/abecasis/QTDT/), with some modifications by using permutation procedure to infer familial relationship (see PLINK website).

For a GWA dataset using family other than triads (e.g. discordant sibpairs or nuclear families of multiple siblings with or without parents), one will need to seek out the existing family-based association programs. Regardless what association methods and programs are chosen, it is important to perform proper association tests with proper statistical methods. One should judge their own dataset to determine the data analysis strategies.

Association tests are generally performed for a single marker at a time or haplotypes of multiple markers within a feasible window size. So far, most GWA studies focus on single marker association tests first. The computational time and strategy of performing haplotypes association tests are the main concern for the genome-wide haplotype analysis, even though haplotypes association methods and programs have been developed extensively in the past (e.g. Haplo.stat, APL, etc).33,41 The use of sliding windows with a fixed window size is a popular approach, but it does not capture the joint effect of distantly located SNPs. One common practice is to conduct single locus association analyses first to identify target regions (or genes)