Speakers: Mykyta Artomov (Nationwide Children’s Hospital, Columbus, OH, USA), Yakov Tsepilov (Wellcome Sanger Institute, Cambridge, UK)
- Genotype imputation should be used for two basic reasons: (i) harmonizing the data obtained with different platforms to improve comparison and meta-analysis; (ii) increasing the probability of observing the true causal variant among variants used for association testing. Genotype imputation should not be considered a tool for “fixing” data with high missingness rates;
- Imputation relies on an external genotype panel which should be relatively closely matched to the studied population (i.e., it is not necessary to have the exact same population, but the distance should be as small as possible). Commonly used panels are 1000 Genomes, Haplotype Reference Consortium (HRC), and NHLBI TopMed.
- Imputation quality can be evaluated using PCA plot (imputation should not shift the data significantly) or by computing polygenic risk scores and comparing them with the observed phenotypes. Another way of testing imputation is to mask a subset of observed sites and compare the imputed genotypes with the original ones for this subset of variants.
- Quality filtering should be performed separately for case and control cohorts in case-control setting, and the sets of thresholds should be similar between the two subsets (not to remove different proportions of samples). The set of thresholds (especially, for human genetics) is standard and should not be significantly tuned to the dataset;
- Batch effects have to be evaluated (to make sure some batches do not significantly deviate from others). Still, even if no batch effect is noticeable, using batches as covariates during association testing is helpful. At the same time, one has to bear in mind that inclusion of each new covariate reduces the power of association testing.
- Linear mixed models for association testing enable the analysis of related individuals, but are more computationally intensive, which may be problematic for large-scale data;
- Multiple testing adjustment in GWAS is typically performed by applying a genome-wide significance threshold, placed at 1 × 10-8. Other methods, such as false discovery rate (FDR) correction could also be used in certain cases. Variants not reaching significance after proper correction for multiple testing should never be considered and reported, irrespective of the patterns observed on Q-Q plots or heritability estimates.
- Downstream processing of GWAS summary statistics include trait-level (genetic correlation, pathway analysis) and locus-level methods. The latter group includes statistical fine mapping, which is used to identify credible sets of likely causal variants for each locus. Variants included in the credible set after fine mapping can be considered causal if the following assumptions are met: (i) the causal variation is genotyped; (ii) there are no outliers; and (iii) secondary signals are properly taken into account.
Links to resources and tools mentioned during the discussion, as well as relevant literature: