Bioinformatics Hackathon'2023
Yerevan, Armenia | 11–13 August

ABOUT

Bioinformatics Hackathon'2023 is an exciting hackathon that brings together teams of biologists and programmers to solve real challenges in biological data analysis collaboratively. With a tight timeframe of 48 hours, participants are poised to engage in rapid innovation, problem-solving, and having fun!

WINNERS

1st place
TripToFun Team

Project:

Analysis of T-cells immunological landscape and its association with graft-versus-host disease


Team members:

Elena Ocheredko

Nadezhda Lukashevich

Ekaterina Nesterenko

Anna Shchetsova



Team Leader:

Mikhail Drokov

2nd place
WOOF Team

Project:

Automating the Detection and Annotation of O-antigen Operons in Prokaryotic Genomes


Team members:

Oksana Kotovskaia

Ekaterina Marenina

Igor Ostanin

Nadezhda Pavlova

Nikita Vaulin


Team Leader:

Polina Kuchur

3rd place
Lobster Lovers

Project:

Clustering of small-

cell lung cancer (SCLC) based on expression pattern and microenvironment


Team members:

Anna Kalygina

Lada Polyakova

Elizaveta Terekhova

Alina Shtork



Team Leader:

Ivan Valiev

Special prizes

The Three People Team

Project: Unraveling the Hidden Dance: An Epistatic Exploration of 15 Million+ SARS-CoV-2 Genomes

Team members: Daria Likholetova, Tatiana Pashkovskaia

Team Leader: Aleksey Komissarov


Barbie team

Project: Contrastive Learning for scRNA-seq Sample Representation

Team members: Liliia Bogdanova, Amina Ibragimova, Svetlana Tarbeeva

Team Leader: Vladimir Shitov


WINx Team

Project: Protein structure prediction using Deep Learning

Team members:Elizaveta Vlasova, Evgeniia Chukhrova, Victoria Latynina, Giomar Vasileva, Dana Zhetesova

Team Leader: Aleksey Komissarov


HaemoHazaRd Team

Project: Retrospective analysis of complex disease trajectories in patients after stem cells transplantation

Team members: Oleg Arnaut, Ivan Negara, Alisa Selezneva, Nikita Volkov

Team Leader: Ivan Moiseev

ROJECTS


The root of all evil: finding common properties of disease-associated gut bacteria

Yury Barbitoff | IBRE

Result
Team
Grigory Gladkov
Olga Muraeva
Lev Shagam
Marianna Zhiganova

Hundreds of studies focusing on gut microbiome changes in human disease have been performed in the last few years. In the majority of such studies, multiple groups of bacteria are reported as associated with a certain disease condition. Recently, multiple metagenomic studies have been aggregated to construct the Human Gut Microbiome Health Index (HGMHI) that includes multiple bacterial families, species, and strains that are prevalent in the gut of healthy or diseased individuals. However, it is unclear what is the main factor that contributes to the increased or decreased proportions of HGMHI bacteria in the human gut depending on the disease status. In this project, we will try to identify the common properties of these bacteria.

Tasks
1. Get as much information as possible about the bacterial taxa in the HGMHI (genome sequences, proteome composition data, metabolic profiles, etc.);
2. Construct a control group of bacterial species that are present in the human gut microbiome irrespective of the disease status, get the same set of data for species in this group.
3. Investigate the differences between the HGMHI bacteria on the genomic, proteomic, or metabolomic level, provide biological and/or clinical interpretation of the results.
4*. Construct an updated version of HGMHI, compare its predictive power to the original study and other studies published recently.

Requirements for the team
At least one team member should have experience with accessing genomic and proteomic data from major resources. One or more team members should have at least some experience with building machine learning models (most likely, we will not need advanced classifiers for this project). All team members should have some basic skills with R/Python and shell.

Multiphase aging clocks on two types of omics data

Aleksey Alekseev | Independent researcher

Result
Team
Dmitrii Kriukov
Aleksandr Petrov
Aleksei Zarubin

The approach of constructing an aging clock based on the regression of panels of different omics data to chronological age is known in the literature. However, this approach has a significant drawback because it treats aging as a unidirectional monotonic process. According to some data, human aging consists of at least three stages – up to 37 years old, from 37 to 60 years old, and over 60 years old. There are open metabolomic and epigenetic datasets on which a multiphase aging clock can be constructed.

Tasks
Case solution participants are asked to construct such a clock for different age ranges, identify the major phases of aging, and propose models for the aging clock for each phase, based on methylation data (several datasets) and metabolomics (one comprehensive dataset). Several methylation datasets need to be normalized and harmonized.

Datasets:
1) https://data.hpc.imperial.ac.uk/resolve/?doi=6945&access=
2) GEO database:
GSE87571
GSE55763
GSE40279
and possibly some more.

It is required to obtain one or several algorithms (models) for the prediction of the biological age as well as markers of transition from one stage of aging to the next. Show the accuracy of the obtained algorithms – separately for methylation, separately for metabolomics.

For metabolomic dataset one can also identify corresponding metabolites (by m/z value and other values) – there are free services for that.

Requirements for the team
R/python, biostatistics skils.

Contrastive Learning for scRNA-seq Sample Representation

Vladimir Shitov | Helmholtz Munich

Result
Team
Liliia Bogdanova
Amina Ibragimova
Svetlana Tarbeeva

Single-Cell RNA-sequencing data allows researchers to describe cell variability with unprecedented resolution. The number of single-cell datasets grows each year, leading to the emergence of atlassing projects, which combine data from hundreds or even thousands of individual donors. It opens the possibility to study the variability on a sample or patient level, discovering, for example, disease trajectories and linking them to molecular features. Several methods for sample representation exist, however, a contrastive learning approach was not yet applied for this task. In this project, participants will work with high-dimensional single-cell data, develop a neural network-based method, and suggest metrics for the interpretation of sample representation.

Tasks
  1. Apply the contrastive learning framework to represent samples from single-cell RNA sequencing data for COVID-19 datasets
  2. Fine-tune the hyperparameters of the model, manually or using automated ML methods
  3. Benchmark sample representation comparing it with baseline methods
  4. Develop metrics for interpretation of representation
  5. Check if the conclusions from the papers or other established facts can be supported by representation

Requirements for the team
For people with technical background:
- Good knowledge of Python (numpy, pandas, plotting libraries)
- Basics of deep learning (preferably, knowledge of pytorch or wish to learn it)
- Wish to analyse high-dimensional biological data

For people with biological background:
- Experience with transcriptomics data (ideally, single-cell)
- Basics of Python (ideally, knowledge of scanpy library)
- Ability to read papers about single-cell datasets or analysis

Github: https://github.com/VladimirShitov

"Sex of the brain" – predicting sex assigned at birth from brain activity

Ilya Zakharov, Alexey Shovkun | Brainify.AI

Result
Team
Aleksandra Beliaeva
Anna Ivanova
Anna Kapitonova
Savelii Komlev
Kristina Ushakova

By analyzing 5,500 magnetic resonance imaging (MRI) structural scans of more than 1400 human brains, Joel et al. (2015) identified specific "female-brain features" and "male-brain features" in the brain. Their findings identified a substantial overlap in the characteristic features of most brain regions, creating a so-called "mosaics" of feminine-masculine traits in each individual. Despite the generally small magnitudes of sex-related brain differences (when structural and lateralization differences are present independent of size, sex/gender explains only about 1% of total variance)and the "continuum view," it has been demonstrated that one's sex can be predicted from brain activity with notable accuracy. For example, Zoubi et al. (2020) demonstrated that the area under the receiver operating curve for sex classification could reach 89% for a logistic regression model trained on the intrinsic BOLD signal fluctuations from resting-state functional MRI (rs-fMRI). In another study, van Putten et al. (2018) showed that a deep neural network trained on electroencephalographic (EEG) rhythmic activity could predict sex from scalp electroencephalograms with an accuracy of over 80%.

For the current project, we propose to develop ML model that will improve the results of van Putten on the EEG data. Several publicly available datasets exist for that purpose (e.g., TUH dataset, Harati et al., 2015, or TD-Brain dataset, Dijk et al., 2022). Both handcrafted EEG features and automatically extracted features (e.g., with DCNN models) can be used for prediction, although the interpretability of the model has to be kept in mind.

Tasks
  1. EEG data preparation and cleaning
  2. ML model development and validation
  3. Testing the association between outcomes of the developed ML models and metadata available in the dataset (testing the potential of the model as the biomarker)
  4. Investigating interpretability of the model (with focus both on technical and biological aspects)

Requirements for the team
The only hard prerequisit is knowing python. Any type of ML model can be used, although the special focus on cross-validation of the model is important (due to the nature of brain activity). Additional tools that can be helpful MNE-Python (https://mne.tools/), or EEGLAB. Some libraries that can be used for handcrafted features generation:
- https://pypi.org/project/NEURAL-py-EEG/
- https://eeglib.readthedocs.io/en/latest/

Brainify.AI team paper on age prediction from EEG data.

Automating the Detection and Annotation of O-antigen Operons in Prokaryotic Genomes

Polina Kuchur | Independent researcher

Result

Team
Oksana Kotovskaia
Ekaterina Marenina
Igor Ostanin
Nadezhda Pavlova
Nikita Vaulin

The immune response is activated via structures found on the surface of bacterial cells. One such structure is the somatic, or O-antigens, which are remote components of lipopolysaccharides. These O-antigens are essential for bacteria to establish symbiotic relationships with plants. Nonetheless, during the course of evolution, certain pathogenic microorganisms have managed to exploit O-antigens for infecting their hosts. The diversity of this structure has significantly increased over time. Tasks involving comparative genomics demand quick identification of O-antigen operons. We have previously devised a manual method to locate them. However, in this hackathon, our aim is to automate this procedure. YouWe will use complete genome assemblies as input, and the desired output is the identification of somatic antigen operons, complete with annotations and images.

We will provide a brief introduction to the organization and annotation of prokaryotic genomes. Additionally, we'll discuss the crucial O-antigen genes and guide you on how to locate them.

Tasks
  1. We possess a variety of tools suitable for each stage of analysis. Your initial step will be to evaluate these tools and seek out alternatives if needed.
  2. Your next step involves scripting for parsing the output generated by each program and facilitating its transition to the subsequent phase.
  3. Subsequently, you will establish a comprehensive workflow.
  4. Following the workflow creation, you will rigorously test it to ensure its functionality.
  5. You’ll then develop a user manual detailing the usage and functionalities of our workflow.
  6. Finally, you’ll publish the established workflow, or pipeline, on GitHub for public access and further collaboration.

Requirements for the team
Programming languages, knowledge of particular software, etc.

At least one team member proficient in Python and Linux
Skills that will definitely speed up the project: Knowledge of prokaryotic genetics and genomics, prior experience with O-antigens, familiarity with Snakemake, GitHub, and data visualization using Python. However, it's worth noting that the hackathon provides an excellent opportunity to learn something new!

GitHub: https://github.com/aglabx/OantigenMiner
Papers:
https://doi.org/10.18699/VJGB-22-98
https://doi.org/10.1101/2022.04.05.486866
https://doi.org/10.1016/j.ijbiomac.2020.06.093

Hi-C Copilot

Danil Zilov | Independent researcher

Result
Team
Stanislav Mitusov
Elena Pazhenkova
Alena Taskina
Peter Zhurbenko

With the help of modern sequencing methods, we have learned to assemble genomes to a sufficiently high resolution. Currently, the majority of the time in genome assembly is spent on curating Hi-C maps. During the hackathon, your task is to develop an MVP assistant tool to facilitate the curation process.

Throughout the project, we will demonstrate the source and structure of Hi-C data, how genome curators work with them, and the main challenges encountered in the process. Additionally, we will show you common artifacts found in Hi-C maps.

Tasks
1. Gain a comprehensive understanding of HIC data, including the concept of a HiC map, the biological principles underlying them, and their significance in bioinformatics beyond just assemblies.
2. Acquire the skills to read and interpret HiC files, and establish the ability to effectively associate them with the corresponding genome assembly.
3. Develop proficiency in identifying and recognizing major artifacts present in HiC files, such as incorrect order of genomic regions, inversions, and merged chromosomes. Explore techniques to detect these artifacts at the HiC file level.
4. Devise an approach or methodology for automatic correction of the identified artifacts. This may involve the use of computational algorithms, statistical models, or other relevant techniques to rectify the detected issues in HiC files.

Requirements for the team
Programming languages, knowledge of particular software, etc.

Minimal: Python.
Would speed up the process: experience with HiC, binary files, OpenCV, Java.

GitHub: https://github.com/aglabx/HiCPilot

Unraveling the Hidden Dance: An Epistatic Exploration of 15 Million+ SARS-CoV-2 Genomes

Aleksey Komissarov | Independent researcher

Result
Team
Daria Likholetova
Tatiana Pashkovskaia

This comprehensive project entails an in-depth examination of the SARS-CoV-2 virus with a distinctive focus on its lesser-known genetic composition. Our investigative endeavor will concentrate on the exploration of epistasis, examining the interplay between distinct mutations in the non-Spike genes of SARS-CoV-2. Through detailed data analysis, we aim to discern patterns of co-occurring mutations and comprehend their collective influence on the virus's evolution.

Utilizing advanced data analysis techniques, our goal is to unravel hidden interactions that may elucidate the virus's behavior, its adaptability, its rapid transmission, and its resistance to certain treatments. With the application of machine learning algorithms, we intend to construct predictive models to foresee the possible impacts of these mutations on the virus, thereby informing potential future trends in its evolution.

The culmination of this project will be the generation of a comprehensive report and presentation, encapsulating key insights gleaned from our research. These findings may significantly enhance our strategic response to future viral pandemics.

Tasks
The following steps will facilitate the achievement of our project goals:
  1. Data Collection: Initiating with the collation of over 15 million SARS-CoV-2 genomes from GISAID and NCBI databases, we aim to assemble as extensive a data set as possible.
  2. Data Preprocessing: This critical phase involves data cleaning and preprocessing, focusing on the detection of mutations in the non-Spike genes of the virus.
  3. Mutation Analysis: With the data duly prepared, we will proceed to identify patterns of co-occurring mutations, furnishing preliminary indications of potential epistasis.
  4. Interpretation of Epistasis: This phase will entail the use of state-of-the-art analytical tools to delineate the interactions between co-occurring mutations, determining whether they are collaborative or competitive.
  5. Predictive Modeling using Machine Learning: Here, we will employ machine learning techniques to construct predictive models to estimate the potential effects of these mutation interactions on the future evolution of the virus.
  6. Model Evaluation and Optimization: Subsequently, we will assess the models for accuracy, performing necessary modifications to ensure optimal performance.
  7. Data Representation: Having obtained a wealth of insights, we will craft a comprehensive report, presenting the data in an informative and compelling narrative, supported by visual aids to better communicate our findings.
  8. Knowledge Dissemination: The final stage involves converting our report into an engaging presentation, elucidating our research journey, its discoveries, and its implications for strategic planning in future pandemic scenarios.

Requirements for the team
Required Skills:
1. Python Programming: essential for handling, cleaning, analyzing, and visualizing data, as well as building machine learning models.
2. Data Analysis: ability to analyze and interpret complex data sets. Familiarity with Python libraries like pandas, NumPy, and matplotlib is a must.
3. Bioinformatics: basic understanding of genomic data, mutations, and viral genetics. Familiarity with tools for genomic data analysis will be beneficial.
4. Machine Learning: knowledge of machine learning algorithms, how to implement them using libraries such as scikit-learn, and how to evaluate and refine their performance.

Skills That Can Speed Up the Project:
1. Advanced Bioinformatics: proficiency in handling large genomic datasets and familiarity with specialized tools for genomic data analysis can expedite the data processing and analysis phases.
2. Experience with Linux and Parallel Programming: knowledge of Linux can aid in handling the large data sets and running complex machine learning models more efficiently. Ability to design and implement parallel algorithms and pipelines can significantly speed up data processing and model training phases.
3. Domain Knowledge in Virology: understanding of viral genetics, specifically regarding SARS-CoV-2, can provide valuable insights during data analysis and model building.
4. Data Visualization Expertise: proficiency in creating engaging and informative visualizations using libraries like seaborn, Plotly, or tools like Tableau can expedite the reporting process.
5. Communication Skills: strong written and verbal communication skills can speed up the final reporting and presentation preparation stage.

GitHub: https://github.com/aglabx/DecodingViralDance

Beyond T2T Human Assembly: Resolving Pathogenic Variants in Multicopy Gene Families using WGS

Aleksey Komissarov | Independent researcher

Result
Team
Danil Ivanov
Gleb Khegai
Vlad Maximov
Mikhail Solovyanov
Danil Stupichev
Sergei Volkov

The advent of T2T (Telomere-to-Telomere) human assembly has revolutionized the field of genomics by providing fully assembled genes, including segmental duplications and multicopy genes. These regions, which were previously challenging to assemble, often contain crucial genes with significant implications for human health. In this two-day hackathon project, your task is to estimate the number of pathogenic variants present in these multicopy genes that can be resolved using short reads generated from Whole Genome Sequencing (WGS). Additionally, you need to determine the minimum read length required for the accurate resolution of variants in these genes. Furthermore, you will investigate the number of known variants that are located within the hard-to-unique resolve regions.

Tasks
Identification of challenging loci in hg38 and T2T assemblies (e.g. the longest common substring over hamming distance problem and an interval trees):
  1. Identify genomic regions in the hg38 assembly with sequence repeats exceeding the length of raw reads and/or popular insertion sizes.
  2. Identify loci that are single copy in hg38 but multicopy in the T2T assembly.
  3. Perform an intersection analysis with ClinVar variants to identify variants with known effects.
  4. Identify regions within multicopy genes that present challenges for unique resolution.
  5. Assess the impact of unresolved variants in these regions on clinical interpretations and potential disease associations.
If these tasks are completed quickly, additional tasks can be solved.

Requirements for the team
Required Skills:
1. Python Programming: essential for handling, cleaning, analyzing, and visualizing data.
2. Data Analysis: ability to analyze and interpret complex data sets. Familiarity with Python libraries like pandas, NumPy, and matplotlib is a must.
3. Bioinformatics: basic understanding of genomic data, mutations, and viral genetics. Familiarity with tools for genomic data analysis will be beneficial.

Skills That Can Speed Up the Project:
1. Advanced Bioinformatics: proficiency in handling large genomic datasets and familiarity with specialized tools for genomic data analysis can expedite the data processing and analysis phases.
2. Familiarity with genomic assemblies: prior knowledge and experience working with genomic assemblies like hg38 and T2T will facilitate the identification of challenging loci and the interpretation of results.
3. Familiarity with ClinVar data: prior knowledge and experience working with human variants will facilitate the identification of challenging loci and the interpretation of results.
4. Data Visualization Expertise: proficiency in creating engaging and informative visualizations using libraries like seaborn, Plotly, or tools like Tableau can expedite the reporting process.
5. Communication Skills: strong written and verbal communication skills can speed up the final reporting and presentation preparation stage.

GitHub: https://github.com/aglabx/beyondT2T

Predicting multifactorial phenotypes: better breeding

Lavrentii Danilov, Mikhail Rayko | Independent researchers

Result
Team
Alexey Kosolapov
Kseniia Maksimova
Vera Panova
Kseniia Struikhina

A large number of organismal traits are characterized by their polygenic nature, i.e., the manifestation of phenotypes depends on several mutations. There are a number of methods that make it possible to determine the so-called epistatic pairs - associated SNPs that affect the same trait. In this project, we want to analyze the results of such methods and check whether they can be used to determine the optimal sets of SNPs to obtain a given phenotype.

Tasks
  1. Parse the results of pairwise association analysis of SNPs (e.g., MIDESP result) associated with a particular quantitative phenotype
  2. Represent the results as a graph of associated SNPs and loci
  3. Devise an algorithm that looks for potential SNP combinations to maximize/minimize the manifestation of the original phenotype
  4. Wrap it all up in a convenient pipeline and get an application for SNP-guided CRIPSR-associated selection

Requirements for the team
Knowledge of at least one programming language (Python, Java), fundamentals of genetics. Ideally, experience with GWAS data and graph analysis.

Auxological predictors of intrauterine fetal growth restriction

Dmitrii Zhakota | Independent researcher

Result
Team
Ruslan Alagov
Yuri Malovichko
Anton Shikov

A significant medical and demographic problem is fetal growth retardation (FGR). Specialists in ultrasound, obstetrics and gynecology, neonatology, and pathological anatomy are searching for metrics to diagnose FGR.

Despite the multidisciplinary approach, it has not been possible to develop conclusive criteria that can be used by all specialists at any stage of fetal development. We will test hypotheses about the accuracy of FGR assessment from the perspectives of pathologists and neonatologists using a dataset of adverse pregnancy outcomes.

Tasks
Task 1: Exploratory data analysis
Objective: Generate basic summary tables and visualization of data. Get preliminary about associations between FGR event and various features.

Task 2: Inferring FGR probabilty
Objective: Using conventional biostatistical approaches estimate significance and effect size of various factors potentially impacting FGR.

Task 3: Assessment of existing scores
Objective: Basing on the dataset estimate applicability of various exisiting scores to FGR diagnostics.

Task 4: Development of new diagnostic score for FGR
Objective: apply modern biostatistics and machine learning approach for the development of a new score, ouperforming existing approaches.

Requirements for the team
Knowledge of the R/Python language and skills in data visualization, statistics and machine learning are desirable. No medical knowledge required – we will understand everything on the spot through simple examples and analogies.

Retrospective analysis of complex disease trajectories in patients after stem cells transplantation

Ivan Moiseev | Independent researcher

Result
Team
Oleg Arnaut
Ivan Negara
Alisa Selezneva
Nikita Volkov

Graft versus host disease (GvHD) is a complication that might occur after an transplantation of bone marrow. GvHD is characerized with a high prevalence and noticable probability of death. That why the analysis of various factors impacting GvHD probability remains topical problem in modern oncohematology. In the current project we propose to analyse a unique dataset containing data about GvHD diagnosis and treatment collected across 4 transplant centers in Russia. No statistical analysis of the data was done yet, so the team will make the first and important step in mining new insights in this field.

Tasks
Task 1: Gathering summary statistics of data (distribution of age/sex/nosologies etc. accross centers and outcomes).

Task 2: Perform basic survival analysis to understand factors affecting mortality, NRM, response to therapy.

Task 3: Performed advanced analysis to get more complex associations taking into account several lines of therapy and confounding factors in multiple parameters space.

Requirements for the team
Programing in R. Knowledge of non-parametric tests, survival analysis, machine learning–based modeling.

Application of natural language processing methods for parsing histological and endoscopic studies results in oncohematological patients

Ivan Moiseev | Independent researcher

Result
Team
Anna Avagyan
Oleg Kosarev
Anna Manaseryan
Vladislav Markelov
Anastasiia Murzina

Graft versus host disease (GvHD) is a complication that might occur after an transplantation of bone marrow. GvHD is characerized with a high prevalence and noticable probability of death. That why the analysis of various factors impacting GvHD probability remains topical problem in modern oncohematology. However some rich piece of data contating text results of some clinical procedures (mainly histology and endoscopy) is still non used for GvHD analysis. Thus machine text analysis is required. The goal of the project is to determine key characteristics in text descriptions for final diagnosis, predictors of response to therapy and survival. The complexity of the data set is determined by simultaneous endoscopy of lower and upper gastrointerstinal tract in some patients at single time and re-biopsies in the other and presence of GVHD manifestations outside of gastrointestinal tract.

Tasks
Primary endpoint:
Determine key features in text description of histological and endoscopic studies that predict clinical diagnosis and response to therapy

Secondary endpoints:
- None-relapse mortality predictors
- Overall survival predictors
- Response predictors

Requirements for the team
Experience in R/Python programming with:
- Machine text analysis
- Multivariate modeling
- Cluster analysis

Analysis of T-cells immunological landscape and its association with graft-versus-host disease

Mikhail Drokov | Independent researcher

Result
Team
Nadezhda Lukashevich
Ekaterina Nesterenko
Elena Ocheredko
Anna Shchetsova

Transplantation of allogeneic hematopoietic stem cells allows achieving a biological cure in patients with leukemia. But the main reason for the decrease of disease-free survival and quality of life is chronic graft-versus-host disease (chronic GVHD), which develops in more than half of patients. Chronic GVHD remains a clinical diagnosis but attempts are being made worldwide to find diagnostic markers of this complication. The pathogenesis of this condition is not fully understood, but the role of T-lymphocytes has been proven.

A large dataset of immunological data is presented (analysis of the subpopulation composition of T-lymphocytes - 162 subpopulations) for 70 patients after stem cells transplantation. Chronic GVHD developed in 28 of these 70 patients.

Purpose: application of modern data processing and visualization technologies for the prediction and characterization of chronic GVHD.

Tasks
Task 1: basic preprocessing of initial dataset

Task 2: choosing and implementation of dimension reduction methods for visualization of data

Task 3: clustering of T-cells profiles in order to estimatie basic “immunological portraits” of the patients.

Task 4: Find associations between immunological portrait and GvHD development.

Requirements for the team
- experience in statistical analysis in R/Python
- machine learning
- visualization of multi-dimensional data
- cluster analysis

Evaluation of KIR-receptor impact on graft function disruption after allogeneic hematopoietic stem cell transplantation

Ekaterina Mikhaltsova | Independent researcher

Result
Team
Alexey Glazkov
Henrik Grigoryan
Armine Kazarian
Uliana Maslikova

Allogeneic hematopoietic stem cell transplantation allows to achieve a cure in patients with blood system diseases. Disruption of normal graft function is a group of complications after transplantation, which significantly worsens the prognosis and quality of life of patients.

All over the world there are attempts to find diagnostic markers of this complication. The pathogenesis of this condition is not fully understood, there are sporadic works on the influence of NK cells, in particular of KIR-receptors, on the development of graft failure.

Tasks
Task 1: Identify landscape of in donor-recipient pairs and their combination.

Task 2: Find associations between types of transplantation and KIR landscape.

Task 3: Using biostatistics and machine learning methods identify the relationship between KIR in donor-recipient pairs and transplant outcomes.

Task 4: Find associations of KIR clusters and the function of the transplant.

Requirements for the team
- experience in statistical analysis in R/Python
- machine learning
- visualization of multi-dimensional data
- cluster analysis

Protein structure prediction using deep learning

Aleksei Artemiev | SimpleFold

Result
Team
Evgeniia Chukhrova
Victoria Latynina
Giomar Vasileva
Elizaveta Vlasova
Dana Zhetesova

The goal of the project is to develop a system to predict second protein structure using known open-source instruments (ESM, AlphaFold, ProtTrans, and Ankh pretrained LLM models). It should have a good API interface and visualization functions (additional task). Unfortunately, there is no common and convinient libraries that provide such a functional.

//www.linkedin.com/in/aalksii/

Tasks
1. Collect SOTA approaches into a pipeline;
2. Create convinient inference pipeline (similar to Hugging Face);
3. Develop user friendly API;
4. Embed visualization functions (additional task);
5. Prepare an example of the library usage for a protein in FASTA format.

Requirements for the team
Software Engineers (Python, Flask/Django, storages), Data Scientists (data manipulation and visualization), Bioinformatics (protein engineering/design, sequence manipulation algorithms), ML Researchers and Engineers (NLP: LLM models, RNNs) and domain experts are welcome to join!

Identification of New Aging Biomarkers using RNA-seq data analysis and Machine Learning

Iaroslav Mironenko | Independent researcher

Result
Team
Victoria Fedulova
Nikita Katermin
Daria Klimova
Fedor Logvin
Anton Sivkov
Xenia Sukhanova

This bioinformatics project integrates RNA-seq data analysis and machine learning to identify new potential biomarkers of aging. By applying rigorous computational methods and validation experiments, the project aims to contribute to our understanding of the aging process and potentially provide insights into the development of therapies targeting age-related diseases.

Tasks
Data Collection and Preprocessing:

Acquire RNA-seq data from publicly available aging-related studies, ensuring diverse tissue types, age-releated diseases and age groups are included.
Conduct quality control measures to remove outliers and low-quality samples, which may involve examining sequencing metrics, removing samples with low read counts, or checking for batch effects.
Perform read alignment against a reference genome and quantify gene expression levels using tools such as STAR, HISAT2, or Salmon.

Differential Expression Analysis:

Employ statistical methods like DESeq2, edgeR, or limma to identify genes that exhibit significant differential expression with age.
Apply appropriate statistical tests to determine significant age-related gene expression changes, considering factors such as fold-change and adjusted p-values.
Generate a list of potential aging-associated genes based on the statistical significance of their expression changes.

Feature Selection and Dimensionality Reduction:

Employ feature selection techniques to reduce the number of genes and focus on the most informative features for aging prediction.
Utilize methods like variance thresholding, mutual information, or recursive feature elimination to identify genes that contribute most to age-related variations.
Perform dimensionality reduction techniques such as PCA or t-SNE to visualize the data in lower-dimensional space and detect underlying patterns or clusters.

Machine Learning Model Development:

Find new, previously unknown biomarkers using machine learning, based on the similarity of the found biomarkers using rna-seq data analysis

Split the data into training and testing sets to evaluate model performance.
Explore a range of machine learning algorithms, such as random forest, support vector machines, logistic regression, or neural networks.
Train the models using the training set, optimizing hyperparameters through techniques like grid search or Bayesian optimization.
Assess the performance of the models using appropriate evaluation metrics such as accuracy, precision, recall, F1-score, or area under the receiver operating characteristic curve (AUC-ROC).

Biomarker Validation and Interpretation:

Validate the selected machine learning model using independent datasets or through cross-validation techniques to assess its generalization performance.
Interpret the model's feature importance scores or coefficients to identify the most informative genes (biomarkers) for aging prediction.
Conduct gene ontology enrichment analysis or pathway analysis to explore the biological functions and pathways associated with the identified biomarkers, gaining insights into the underlying mechanisms of aging.

Requirements for the team
R or Python programming, skills of NCBI using, GEO particularly, RNA-seq data analysis skills.

LLM on the lam: are pre-trained embeddings good representations to argue about protein stability?

Tatiana Malygina, Mariia Lomovskaia | Helmholtz-Institute for Pharmaceutical Research Saarland, independent researcher

Result
Team
Alexander Gavrilenko
Daria Gorbach
Evgenii Potapenko
Nadezhda Sorokina

Currently there is a zoo of large language models available, some of them can even speak protein language. Also there are numerous models based on the transformer architecture with some geometric deep learning flavor which utilize protein structural information. In this project we propose the participants to play with some of them, to apply them to the mini-proteins dataset by Rocklin et al., and to see what approaches are better for this task – the ones based on human knowledge (can be found in the literature) vs. simple ones with the features learnt by a LLM who was given 200 TPUs and has seen too much (the participants will try to implement them during the hackathon).

Tasks
The dataset by Rocklin et al. can be utilized in different settings. It has pairs of protein sequences (wildtype and mutated) with the corresponding stability score. Also, the pdb files with the protein structures are available. For the simpler setting which is doable during hackathon I propose to use protein sequence pairs.
1. Pick one or several protein language models. I suggest using ESM2 (there is a relatively small model available) and/or Ankh model by Rostlab (the participants might want to use something different here).
2. Exploratory analysis: compute the embeddings produced by pretrained protein language model; try to do dimensionality reduction and see how the data with different secondary structure types or different stability score values are packed in the embedding space.
3. Solve protein stability prediction as a metric learning task
4. Compare your team’s stability prediction results with the results of other researchers, which can be found in the literature

https://github.com/songlab-cal/tape#list-of-models-and-tasks

Subtyping of Small-Cell lung cancer

Ivan Valiev | BostonGene

Result
Team
Anna Kalygina
Lada Polyakova
Alina Shtork
Elizaveta Terekhova

Multiple attempts were performed to get reasonable transcriptional subtyping of small- cell lung cancer (SCLC). Yet all of them are questionable.

Are there truly different subtypes (mutually exclusive)? Are they distinct, or there are intermediate forms? Does scRNA-Seq recapitulate findings of bulk RNA-Seq? Does the subtyping keep adequate if we switch from tissue RNA-Seq to cell lines? Can we divide SCLC by different paradigms (i.e. transcriptional factors, microenvironment, etc.)?

All of those hypotheses actually can be tested. Though for the sake of time, we will try to narrow the list. We want to make a set of clustering experiments with a set of SCLC datasets. And expect at least 2 different clustering families to be seen: clustering based on leading transcriptional factors and clustering based on the microenvironment.

Literature for reading:
https://www.mdpi.com/2072-6694/14/22/5600
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6538259/
https://www.cell.com/cancer-cell/pdf/S1535-6108(20)30662-0.pdf

Tasks

Goal of the project:
Obtain 1 reasonable (from point of view of clustering metrics) classification of SCLC. Ideally – with clinical relevance.

Tasks:
  1. Perform quality control of the data (where applicable). We have several datasets of SCLC (and extrapulmonary small-cell cancers), some with raw data, some – only with processed
  2. Harmonize datasets. Almost all datasets will be from RNA-Seq, but still, we expect them to heavily batch. Some batch correction techniques or separate analyses of datasets would be expected.
  3. Collect/engineer principles for clustering. Literature search, known databases like STRING (to acquire neighbors of candidate genes), external tools of microenvironment deconvolution (like EPIC, CIBERSORT, or Kassandra) – anything.
  4. Cluster the data and evaluate the clusters. Typical hierarchical clustering, it’s ensembling in ConsensusClusterPlus, NMF – any way you want, but in the end, we will need some metric to examine adequacy.
  5. *If possible – connect with other features of the datasets (aka CNAs, mutations, clinical traits)
  6. *If possible – train a simple classifier that would be able to predict the element.

Requirements for the team
Basic familiarity with NGS QC (most importantly – RNA QC).
Python – capable of operating with pandas, sci-py, familiar with clustering techniques like hc or NMF. R would be welcome also, but most of the work is expected to be done in Python.
Gene databases (like MSigDB or STRING) – facilitate the process of acquiring potential features for clustering.

Determining the number of copies of specific genes in a tumor

Danil Stupichev | BostonGene

Result
Team
Dmitry Rassolov
Mikhail Slizen
Aleksandr Voskoboinikov
Anastasia Zolotar

Modern tools for copy number analysis (sequenza, facets) are well able to determine the number of DNA copies from WES data at the genomic level: whole chromosomes, chromosome arms and other large genome regions. They are able to assess the ploidy and purity of the tumor. But they are not well-tuned to determine the copy number of specific genes, and it is the copy number of specific genes, or even transcripts, that is important for making treatment decisions. Often the CNA caller output is simply crossed with gene boundaries, which can lead to erroneous results. You have to develop a tool that, from the tumor's WES data, should determine the copy number for clinically relevant genes. Such a tool will help to more accurately determine the copy number of genes in patients with cancer and choose the correct treatment.

Tasks
  1. Using coverage depth data to make a tool in Python
  2. Write understandable documentation
  3. Put the code in the repository

Optionally you can do:
  • Consider normalization for GC composition
  • Consider VAF bias of heterozygous variants and quantify the minor allele
  • Consider additional events that may influence the copy number decision: the presence of a mutation in the gene and gene fusion
  • Wrap tool in docker

Requirements for the team
Python, bash, WES.
Good to know the biology of CNA and how CNA tools work.

Analysis of multiple time points of cfDNA from plasma of patients with oncological diagnoses

Anastasia Yudina | BostonGene

Result
Team
Marianna Baranovskaia
Anton Eliseev
Artyom Ershov
Dmitrii Poliakov

Genetic diagnosis is known to be an important part of cancer treatment. However, whole genome/exome sequencing is an expensive and time-consuming analysis that is not applicable to the dynamic analysis of a patient's condition. Moreover, deep sequencing is required for minimal disease monitoring (MRD) at low limits of detection. Deep sequencing is better to perform on a target panel with a limited amount of genes rather than on whole genome/exome.

A good assay MRD testing is liquid biopsy and analysis of cell-free DNA. In this project, we invite hackathon participants to apply their skills to analyze dynamic data of cfDNA of cancer patients.

Tasks
  1. Annotate somatic point mutations obtained from cfDNA target sequencing data.
  2. Create a pipeline to analyze dynamic data when each patient has multiple time points before and after therapy.
  3. Find and interpret clinically significant events for each patient.
  4. Visualize the results of the analysis, prepare a report on the events found.

Requirements for the team
  • the team must have at least one bioinformatician capable of processing raw variant calling data
  • the team should have a person who knows python, who can assemble the data analysis pipeline and visualize the results
  • the team should have a person who understands oncogenetics, who can interpret the results of the analysis

organizer

  • IBRE is an early-stage international startup that aims to connect researchers, students, and professionals in the field of bioinformatics. The IBRE team has a successful track record, previously founding and developing for 10 years the renowned Bioinformatics Institute, and launching many educational programs and events in the field.
partners

  • BostonGene Corporation is pioneering the use of biomedical software for advanced patient analysis and personalized therapy decision making in the fight against cancer. BostonGene’s unique solution performs sophisticated analytics to aid clinicians in their evaluation of viable treatment options for each patient's individual genetics, tumor and tumor microenvironment, clinical characteristics and disease profile. BostonGene’s mission is to enable physicians to provide every patient with the highest probability of survival through optimal cancer treatments using advanced, personalized therapies
  • EPAM leads the industry in digital platform engineering and product development services. With 57,450+ professionals in over 50 countries, EPAM landed in Armenia about nine years ago, and today it is a team of 900+ innovators creating complex solutions that impact businesses and communities worldwide. 
Questions?
hack@bioinf.institute