IBRE Bioinformatics Club

BIOINFORMATICS CLUB

IBRE Bioinformatics Club is a series of online meetings of the bioinformatics community. It is dedicated to discussion of important topics in bioinformatics, computational biology, data science, and beyond.

Please register for the seminars using the form below. The link to the Zoom conference will be sent to you automatically before each meeting.

Topic

Date

What is the role of AI in the analysis of genetic variation?

August 5, 18:00 CET

How can AI transform the analysis of gene expression data?

August 19, 18:00 CET

How is AI enhancing proteomics and protein structure analysis?

September 2, 18:00 CET

What are the challenges and prospects of AI in bioinformatics?

September 16, 18:00 CET

The meetings are intended for open discussions and sharing of opinions on the topic. Discussion will be moderated by experts from IBRE, Wellcome Sanger Institute, and other research organizations.

Register now

Bioinformatic Analysis of Genome Variation

Below you can find the bullet point summaries of each meeting supplemented with links to papers and tools mentioned during discussion (click on meeting title to expand)

Speakers: Mykyta Artomov (Nationwide Children’s Hospital, Columbus, OH, USA), German Demidov (University of Tubingen, Germany)

While NGS is now a standard method for genetic testing, technical factors and limitations of the technology still play a major role in limiting our ability to detect short genetic variation. If the data quality is low, it is better to go back and correct the issues with the wet lab procedures rather than interpret the variant calls resulting for low quality data (i.e., not trying to “fix” the data quality issues with read preprocessing);
Most of the variant calling methods that are commonly used (e.g., GATK, DeepVariant) perform reasonably well in detecting short genetic variants. If working with a single sample, selecting a caller that produces best sensitivity for single samples according to benchmarks is an optimal strategy. Some centers may develop in-house solutions for variant calling; however, any custom tool has to undergo extensive validation. For comparative studies involving many samples, methods used to construct reference datasets may be preferred over others.
While many pipelines include recommended sets of filters, this step may remove more true positive variants, and hence should be approached with caution. it is important to keep filtered variants so that the person interpreting the data can examine them in case no clear-cut candidate emerges from the high-quality variant set;
Updating a pipeline should be restricted to major tool changes that are expected to provide large performance gains. Any changes in the pipeline should be followed by validation (and certification, if the country law requires such certification);
One-stop annotation pipelines (e.g., Ensembl VEP) are sufficient for routine variant annotation. AlphaMissense, a recent method for missense variant pathogenicity evaluation, has greatly improved the power to assess the clinical relevance of missense variants;
Keeping track of variants identified in other subjects during the analysis is very important for further filtration of platform-specific artifacts and population-specific genetic variation. Joint variant calling for a big set of samples may also provide frequency information, but such an approach is too computationally intensive for this task - hence, keeping a simple database of observations does the trick.

Links to resources and tools mentioned during the meeting:

Our publication describing coverage biases in WES and WGS https://www.nature.com/articles/s41598-020-59026-y
Dr. Artomov’s paper touching upon platform bias and its impact https://www.nature.com/articles/s41588-023-01637-y
Review of germline variant calling methods: https://academic.oup.com/bib/article/25/2/bbad508/7585875
Our recent benchmark of variant callers https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-022-08365-3
Our analysis of adapter removal effect on short variant calling https://f1000research.com/articles/13-506
Genome Analysis ToolKit (GATK) best practices workflow https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels
DeepVariant caller https://github.com/google/deepvariant
AlphaMissense pathogenicity predictor https://www.science.org/doi/10.1126/science.adg7492

Speakers: German Demidov (University of Tubingen), Gennadiy Zakharov (Wellcome Sanger Institute)

Usage of old software tools for CNV analysis is not recommended. Commonly used approaches include Manta (for structural variation), ClinCNV is a good option for CNV analysis. It may also be helpful to validate the results using tools that utilize different types of information - read depth, paired-end read data, split reads.
As CNV calling, especially for targeted resequencing methods, relies on control samples, quality and availability of these control samples are extremely important. The uniformity of coverage is important - hence, the similarity of the samples (e.g., same kit used, similar time period of sequencing). As causal CNVs are typically rare, selection of individuals by phenotype for a control cohort is not usually relevant;
Long read sequencing provides a significant advantage for structural variant analysis. However, coverage is still important for sensitive detection of SVs from LR data. Importantly, long read data are still lack precision for short variant discovery; hence, it is common to use LR sequencing as a second step of the diagnostic pipeline after a negative SR test result.
Similarly to short variant calling, allowing the clinician to review lower-confidence candidates may be beneficial to improve diagnosis rates. However, filtering and prioritization of the candidates based on technical parameters (e.g., likelihood of coverage profile under a duplication/deletion hypothesis) is more crucial for CNVs compared to short variants.
Allele frequency filtering is no less important for CNV analysis, and the annotation of variants with AF data from both reference databases and a local database (processed with your caller of choice) is important.
Annotation of structural variants requires specific annotation pipelines (e.g., AnnotSV). Visual inspection of calls may be even more helpful than plain annotation. However, visualization of structural variation is more complicated compared to short variants, especially for balanced rearrangements (NGB provides one example of an advanced visualization method). One way to simplify visual inspection using conventional tools is to visualize all available data types (breakpoint locations, coverage profile, etc.) in a single genome browser window;

Links to resources and tools mentioned during the meeting:

Manta structural variant caller https://github.com/Illumina/manta; https://academic.oup.com/bioinformatics/article/32/8/1220/1743909
ClinCNV caller https://github.com/imgag/ClinCNV; https://www.biorxiv.org/content/10.1101/2022.06.10.495642v1
AnnotSV pipeline for SV impact annotation https://github.com/lgmgeo/AnnotSV; https://academic.oup.com/bioinformatics/article/34/20/3572/4970516
NGB structural variant visualization https://github.com/epam/NGB/blob/develop/docs/md/user-guide/variants.md

Speakers: Fotis Psomopoulos (Center for Research & Technology Hellas, Greece), German Demidov (University of Tubingen, Germany), Ilia Bizin (EpiValue, Croatia)

There is a large level of discordance between somatic variant callers, which is greater than the level of discordance seen for germline variant detection. It is important to bear in mind that different callers would perform differently depending on the properties of input data (i.e., coverage, read length, etc.) and for variants of different variant allele frequency (VAF);
Post-alignment preprocessing of data is an important factor affecting the accuracy of somatic variant callers. Adding decoy to the reference can also be helpful;
Tumor purity greatly influences the results of somatic variant calling. For homogeneous tumors with high purity, there is no need for an ultra-deep sequencing. Usually, purity is estimated using wet lab procedures, but there exist computational alternatives that produce comparable results. Computational tools for predicting purity commonly rely on copy-number variation. The estimation, however, will not work properly for polyploid tumors;
Variant calling in tumor-normal pairs is much more reliable compared to the tumor-only setting. However, if the variant has low VAF, tumor-normal pairs are still not sufficient to confidently identify a variant. Conversely, for hematological tumors with high purity, tumor-only pipeline performs reasonably well for VAF above 10%. Cancer driver mutations tend to have high VAF in the tumor sample, hence their identification is easier even in the tumor-only case;
Using germline variant caller for tumor-only variant calling is not recommended. Similarly, calling mosaic variants in rare disease contexts using germline callers can also be unreliable and has to be avoided.
Selection of candidate driver mutations involves somatic variation databases (e.g., actionable mutations from COSMIC), functional consequence information, computational prediction (e.g., CADD score), etc. Filtering of variant calls with VAF less than ~ a quarter of the purity estimate is useful for candidate driver mutation selection. Similarly to germline variants, visual inspection of candidates is important to validate the results of somatic variant calling.

Links to resources and tools mentioned during the meeting:

A preprint describing somatic variant calling evaluation methods https://www.biorxiv.org/content/10.1101/2024.03.07.582313v1.full
Mutect2 somatic variant caller https://gatk.broadinstitute.org/hc/en-us/articles/360037593851-Mutect2
CADD tool for deleterious mutation prediction https://academic.oup.com/nar/article/47/D1/D886/5146191
The COSMIC database https://cancer.sanger.ac.uk/cosmic
Paper with discussion of mosaic variant calling approaches [the paper contains controversial recommendations discussed at the meeting] https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02285-3

Speakers: Mykyta Artomov (Nationwide Children’s Hospital, Columbus, OH, USA), Yakov Tsepilov (Wellcome Sanger Institute, Cambridge, UK)

Genotype imputation should be used for two basic reasons: (i) harmonizing the data obtained with different platforms to improve comparison and meta-analysis; (ii) increasing the probability of observing the true causal variant among variants used for association testing. Genotype imputation should not be considered a tool for “fixing” data with high missingness rates;
Imputation relies on an external genotype panel which should be relatively closely matched to the studied population (i.e., it is not necessary to have the exact same population, but the distance should be as small as possible). Commonly used panels are 1000 Genomes, Haplotype Reference Consortium (HRC), and NHLBI TopMed.
Imputation quality can be evaluated using PCA plot (imputation should not shift the data significantly) or by computing polygenic risk scores and comparing them with the observed phenotypes. Another way of testing imputation is to mask a subset of observed sites and compare the imputed genotypes with the original ones for this subset of variants.
Quality filtering should be performed separately for case and control cohorts in case-control setting, and the sets of thresholds should be similar between the two subsets (not to remove different proportions of samples). The set of thresholds (especially, for human genetics) is standard and should not be significantly tuned to the dataset;
Batch effects have to be evaluated (to make sure some batches do not significantly deviate from others). Still, even if no batch effect is noticeable, using batches as covariates during association testing is helpful. At the same time, one has to bear in mind that inclusion of each new covariate reduces the power of association testing.
Linear mixed models for association testing enable the analysis of related individuals, but are more computationally intensive, which may be problematic for large-scale data;
Multiple testing adjustment in GWAS is typically performed by applying a genome-wide significance threshold, placed at 1 × 10^-8. Other methods, such as false discovery rate (FDR) correction could also be used in certain cases. Variants not reaching significance after proper correction for multiple testing should never be considered and reported, irrespective of the patterns observed on Q-Q plots or heritability estimates.
Downstream processing of GWAS summary statistics include trait-level (genetic correlation, pathway analysis) and locus-level methods. The latter group includes statistical fine mapping, which is used to identify credible sets of likely causal variants for each locus. Variants included in the credible set after fine mapping can be considered causal if the following assumptions are met: (i) the causal variation is genotyped; (ii) there are no outliers; and (iii) secondary signals are properly taken into account.

Links to resources and tools mentioned during the discussion, as well as relevant literature:

Hail library for genome-wide association analysis in Python https://hail.is/
A general review on genome-wide association studies: https://www.nature.com/articles/s43586-021-00056-9
Dr. Artomov’s group paper on Biobank Russia https://www.nature.com/articles/s41467-024-50304-1
A paper describing improved fine mapping via replication failure rate https://www.nature.com/articles/s41588-023-01597-3
Paper describing HRC reference panel https://www.nature.com/articles/ng.3643
TopMed imputation server https://imputation.biodatacatalyst.nhlbi.nih.gov/

Speaker: Pavel Flegontov (University of Ostrawa, Czech Republic/David Reich Lab associate, Harvard University, MA, USA)

There are several groups of methods which are commonly used for population history reconstruction. These include estimation of f-statistics and fitting admixture graphs based on these estimates, visualization of population structure using PCA (or related methods), and other tools;
SNP capture methods are more commonly used for ancient DNA analysis than shotgun sequencing of the whole genome. Common capture panels have noticeable differences in the coverage profile they provide. Besides, usage of these platforms introduce notable bias due to non-random selection of polymorphic sites (”ascertainment bias”). which negatively impacts the analysis using common measures (such as f-statistics);
The ascertainment bias (for human populations) is almost exclusively relevant when using African populations. This observation stems from the fact that the commonly used panels do not capture variation that was closer to the root of the human population tree;
Fitting of admixture graphs based on f-statistics can result in the selection of models that have virtually nothing in common with the actual demographic history (as evidenced by simulation studies). In general, fitting of admixture graph models combined with formal hypothesis testing should be avoided, as the identifiability of the actual model is very low;
Visualization of population structure using PCA is very common, but it is also sensitive to various factors, such as the spatiotemporal sampling parameters. Several alternatives to PCA have emerged for solving the same basic task. These include UMAP, t-SNE, and variational auto-encoders (VAE). While UMAP and t-SNE perform relatively well in visualizing local structure, they provide bad results for global structure. VAE, on the other hand, performs well in capturing the actual distance landscape.
Unsupervised genetic clustering with ADMIXTURE tend to cut the actual landscape of genetic variation into parts (sections), with the number of clusters controlling the number of slices. The results obtained using these methods should be interpreted accordingly - not as the actual admixture history, but as sections of the global landscape;

Links to studies mentioned by Dr. Flegontov:

Original 2006 study on the application of PCA to genetic data https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.0020190
A 2012 paper describing f-statistics https://academic.oup.com/genetics/article/192/3/1065/5935193
ADMIXTURE genetic clustering algorithm https://genome.cshlp.org/content/19/9/1655
The paper describing SNP capture methods https://genome.cshlp.org/content/32/11-12/2068.long
Testing the ascertainment bias when applying f-statistics https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1010931
Variational autoencoders in visualizing population structure https://academic.oup.com/g3journal/article/11/1/jkaa036/6105578
Limits of complex demographic history fitting using f-statistics https://elifesciences.org/articles/85492

Speaker: German Demidov (University of Tubingen)

Despite lots of advances, the average diagnostic yield of NGS remains at around 40%. However, it has a substantial dependence on the disease group. For some eye disorders, the rates may be closer to 100%, and may be below 30% for autistic spectrum disorders (ASDs);
The quality of the bioinformatic pipeline is an important determinant. This is especially important if certain analysis types (e.g., CNV analysis) were initially omitted. Besides, the experience of clinicians (both working with the patient’s phenotype and those interpreting the results) is no less important.
Typically, reanalysis of undiagnosed cases can be performed when a new (and more efficient) tool is included into the pipeline, when a new sequencing method is applied, or by simply looking at the same variant data with an updated set of disease genes (every year, up to 100 new gene-disease associations are recorded). However, the time of clinical interpreters is an important limitation for the latter type, and the yield of such reanalysis is expected to be less compared to those involving new data generation;
New analytical approaches (e.g., performing WGS in addition to existing WES data, or using orthogonal methods such as RNA-seq) are important for finding the missing disease cause. For RNA-seq, the diagnosis may be achieved by identifying the changes in expression of a gene that is known to cause an inherited disease. Long reads are extremely beneficial for repeat expansions and structural variation.
Artificial intelligence (or other advanced machine learning methods) is a very promising method for solving the cases in a limited time frame, though these methods need further development.

Links to resources and tools mentioned during the talk:

A review of NGS in pediatric genomics https://www.nature.com/articles/nrg.2017.116
ACMG criteria for variant interpretation https://doi.org/10.1038/gim.2015.30
The Undiagnosed Disease Network (UDN): https://www.annualreviews.org/content/journals/10.1146/annurev-med-042120-014904
The Human Phenotype Ontology (HPO) https://hpo.jax.org/
GestaltMatcher tool for face-to-syndrome metching https://www.nature.com/articles/s41588-021-01010-x
snoRNA variation explains many NDD cases https://www.nature.com/articles/s41586-024-07773-7
German’s paper describing SV calling in the SolveRD cohort https://www.nature.com/articles/s41431-024-01637-4

Computational Methods for Reproducible Research

https://youtu.be/_qS5bCT48xU?si=DlD43sU_cTZTgBo-; Best practices for data sharing: GitHub, Zenodo, NCBI; 1:19:10; Sharing of raw data and code used for its analysis has become a standard requirement for publication in high-end scientific journals. Still, not all data can be publicly shared, and there is an ongoing debate on what proportion of data should be published and how. Recently, platforms such as Zenodo and FigShare have become a popular instrument for publication of both data and code, and now some journals even require posting to these platforms. https://youtu.be/BDboaeYiPtw?si=Dp9vbKswtUgXBJaa; Pipeline development: Nextflow, Snakemake, or bash?; 1:15:06; Analysis of omics datasets usually involves several consecutive steps. Such a series of steps is called a bioinformatic pipeline. For the past several years, multiple languages have been developed for construction and execution of such pipelines, with Nextflow, Common Workflow Language (CWL), and Snakemake among the most popular ones. There is, however, no universal standard for development of pipelines, and many researchers stick with bash scripting for their purposes. https://youtu.be/EVbHsKbkqTc?si=06vwNjsiO4PvPt0c; Avoiding software dependency hell: virtual environments and conda; 1:22:26; Virtually every person with an experience in bioinformatic data analysis faced the issue of software dependency conflicts. The problems is more than a simple inconvenience during installation and execution of a particular tool, as different versions of dependencies may affect the behavior of the software and the reproducibility of results. Several years ago, conda gained significant popularity in the bioinformatics community for package management and creation of virtual environments.. https://youtu.be/6Dpf2ezcb38?si=VAI0OVjBB34MQXzH; Downstream analysis with R Markdown and Jupyter ; 1:33:50; While accurate processing of raw high-throughput datasets is important, it is the downstream analysis of the data which is crucial for making the correct conclusions. R Markdown and Jupyter Notebooks are the two main tools for making the downstream analysis of data in R or Python clean and reproducible. However, it may be challenging to organize the Markdown files or Jupyter notebooks with clean code and proper file structure. https://youtu.be/l4F2kevAFNg?si=azzvZbcJJJp4AYLL; Cloud computing for computational biology; 1:20:54; One of the main recent trends in computational biology and bioinformatics is the rising popularity of various cloud-based solutions. First, more and more people are using cloud computing resources, such as Google Cloud, Yandex Cloud, or Amazon Web Services, both for data analysis and development/deployment of bioinformatic tools. Furthermore, there are now multiple platforms that provide interactive user interfaces and software collections for different analysis types https://youtu.be/D3WhayaxrPI?si=D6mAQFDOuxY74241; Virtualization and containerization; 1:14:05; Management of computational resources is an important topic for any bioinformatician, either on local or remote infrastructure. Indeed, pipelining languages and other technologies discussed on our meetings greatly aid in this task. However, it is important to discuss the benefits of using other tools, including containerization, for more efficient data analysis in computational biology.

Detailed announcements of future sections will be published closer to the meeting date.

If you have any questions, please contact us at info@bioinf.institute