Bioinformatics Hackathon'2023

Yerevan, Armenia | 11–13 August

The root of all evil: finding common properties of disease-associated gut bacteria

Yury Barbitoff | IBRE
Hundreds of studies focusing on gut microbiome changes in human disease have been performed in the last few years. In the majority of such studies, multiple groups of bacteria are reported as associated with a certain disease condition. Recently, multiple metagenomic studies have been aggregated to construct the Human Gut Microbiome Health Index (HGMHI) that includes multiple bacterial families, species, and strains that are prevalent in the gut of healthy or diseased individuals. However, it is unclear what is the main factor that contributes to the increased or decreased proportions of HGMHI bacteria in the human gut depending on the disease status. In this project, we will try to identify the common properties of these bacteria.

1. Get as much information as possible about the bacterial taxa in the HGMHI (genome sequences, proteome composition data, metabolic profiles, etc.);
2. Construct a control group of bacterial species that are present in the human gut microbiome irrespective of the disease status, get the same set of data for species in this group.
3. Investigate the differences between the HGMHI bacteria on the genomic, proteomic, or metabolomic level, provide biological and/or clinical interpretation of the results.
4*. Construct an updated version of HGMHI, compare its predictive power to the original study and other studies published recently.

Requirements for the team
At least one team member should have experience with accessing genomic and proteomic data from major resources. One or more team members should have at least some experience with building machine learning models (most likely, we will not need advanced classifiers for this project). All team members should have some basic skills with R/Python and shell.

Subtyping of Small-Cell lung cancer

Ivan Valiev | BostonGene
Multiple attempts were performed to get reasonable transcriptional subtyping of small- cell lung cancer (SCLC). Yet all of them are questionable.

Are there truly different subtypes (mutually exclusive)? Are they distinct, or there are intermediate forms? Does scRNA-Seq recapitulate findings of bulk RNA-Seq? Does the subtyping keep adequate if we switch from tissue RNA-Seq to cell lines? Can we divide SCLC by different paradigms (i.e. transcriptional factors, microenvironment, etc.)?

All of those hypotheses actually can be tested. Though for the sake of time, we will try to narrow the list. We want to make a set of clustering experiments with a set of SCLC datasets. And expect at least 2 different clustering families to be seen: clustering based on leading transcriptional factors and clustering based on the microenvironment.

Literature for reading:


Goal of the project:
Obtain 1 reasonable (from point of view of clustering metrics) classification of SCLC. Ideally – with clinical relevance.

  1. Perform quality control of the data (where applicable). We have several datasets of SCLC (and extrapulmonary small-cell cancers), some with raw data, some – only with processed
  2. Harmonize datasets. Almost all datasets will be from RNA-Seq, but still, we expect them to heavily batch. Some batch correction techniques or separate analyses of datasets would be expected.
  3. Collect/engineer principles for clustering. Literature search, known databases like STRING (to acquire neighbors of candidate genes), external tools of microenvironment deconvolution (like EPIC, CIBERSORT, or Kassandra) – anything.
  4. Cluster the data and evaluate the clusters. Typical hierarchical clustering, it’s ensembling in ConsensusClusterPlus, NMF – any way you want, but in the end, we will need some metric to examine adequacy.
  5. *If possible – connect with other features of the datasets (aka CNAs, mutations, clinical traits)
  6. *If possible – train a simple classifier that would be able to predict the element.

Requirements for the team
Basic familiarity with NGS QC (most importantly – RNA QC).
Python – capable of operating with pandas, sci-py, familiar with clustering techniques like hc or NMF. R would be welcome also, but most of the work is expected to be done in Python.
Gene databases (like MSigDB or STRING) – facilitate the process of acquiring potential features for clustering.

Determining the number of copies of specific genes in a tumor

Danil Stupichev | BostonGene
Modern tools for copy number analysis (sequenza, facets) are well able to determine the number of DNA copies from WES data at the genomic level: whole chromosomes, chromosome arms and other large genome regions. They are able to assess the ploidy and purity of the tumor. But they are not well-tuned to determine the copy number of specific genes, and it is the copy number of specific genes, or even transcripts, that is important for making treatment decisions. Often the CNA caller output is simply crossed with gene boundaries, which can lead to erroneous results. You have to develop a tool that, from the tumor's WES data, should determine the copy number for clinically relevant genes. Such a tool will help to more accurately determine the copy number of genes in patients with cancer and choose the correct treatment.

  1. Using coverage depth data to make a tool in Python
  2. Write understandable documentation
  3. Put the code in the repository

Optionally you can do:
  • Consider normalization for GC composition
  • Consider VAF bias of heterozygous variants and quantify the minor allele
  • Consider additional events that may influence the copy number decision: the presence of a mutation in the gene and gene fusion
  • Wrap tool in docker

Requirements for the team
Python, bash, WES.
Good to know the biology of CNA and how CNA tools work.

Analysis of multiple time points of cfDNA from plasma of patients with oncological diagnoses

Anastasia Yudina | BostonGene
Genetic diagnosis is known to be an important part of cancer treatment. However, whole genome/exome sequencing is an expensive and time-consuming analysis that is not applicable to the dynamic analysis of a patient's condition. Moreover, deep sequencing is required for minimal disease monitoring (MRD) at low limits of detection. Deep sequencing is better to perform on a target panel with a limited amount of genes rather than on whole genome/exome.

A good assay MRD testing is liquid biopsy and analysis of cell-free DNA. In this project, we invite hackathon participants to apply their skills to analyze dynamic data of cfDNA of cancer patients.

  1. Annotate somatic point mutations obtained from cfDNA target sequencing data.
  2. Create a pipeline to analyze dynamic data when each patient has multiple time points before and after therapy.
  3. Find and interpret clinically significant events for each patient.
  4. Visualize the results of the analysis, prepare a report on the events found.

Requirements for the team
  • the team must have at least one bioinformatician capable of processing raw variant calling data
  • the team should have a person who knows python, who can assemble the data analysis pipeline and visualize the results
  • the team should have a person who understands oncogenetics, who can interpret the results of the analysis

Multiphase aging clocks on two types of omics data

Aleksey Alekseev | Independent researcher
The approach of constructing an aging clock based on the regression of panels of different omics data to chronological age is known in the literature. However, this approach has a significant drawback because it treats aging as a unidirectional monotonic process. According to some data, human aging consists of at least three stages – up to 37 years old, from 37 to 60 years old, and over 60 years old. There are open metabolomic and epigenetic datasets on which a multiphase aging clock can be constructed.

Case solution participants are asked to construct such a clock for different age ranges, identify the major phases of aging, and propose models for the aging clock for each phase, based on methylation data (several datasets) and metabolomics (one comprehensive dataset). Several methylation datasets need to be normalized and harmonized.

2) GEO database:
and possibly some more.

It is required to obtain one or several algorithms (models) for the prediction of the biological age as well as markers of transition from one stage of aging to the next. Show the accuracy of the obtained algorithms – separately for methylation, separately for metabolomics.

For metabolomic dataset one can also identify corresponding metabolites (by m/z value and other values) – there are free services for that.

Requirements for the team
R/python, biostatistics skils.

Contrastive Learning for scRNA-seq Sample Representation

Vladimir Shitov | Helmholtz Munich
Single-Cell RNA-sequencing data allows researchers to describe cell variability with unprecedented resolution. The number of single-cell datasets grows each year, leading to the emergence of atlassing projects, which combine data from hundreds or even thousands of individual donors. It opens the possibility to study the variability on a sample or patient level, discovering, for example, disease trajectories and linking them to molecular features. Several methods for sample representation exist, however, a contrastive learning approach was not yet applied for this task. In this project, participants will work with high-dimensional single-cell data, develop a neural network-based method, and suggest metrics for the interpretation of sample representation.

  1. Apply the contrastive learning framework to represent samples from single-cell RNA sequencing data for COVID-19 datasets
  2. Fine-tune the hyperparameters of the model, manually or using automated ML methods
  3. Benchmark sample representation comparing it with baseline methods
  4. Develop metrics for interpretation of representation
  5. Check if the conclusions from the papers or other established facts can be supported by representation

Requirements for the team
For people with technical background:
- Good knowledge of Python (numpy, pandas, plotting libraries)
- Basics of deep learning (preferably, knowledge of pytorch or wish to learn it)
- Wish to analyse high-dimensional biological data

For people with biological background:
- Experience with transcriptomics data (ideally, single-cell)
- Basics of Python (ideally, knowledge of scanpy library)
- Ability to read papers about single-cell datasets or analysis


Metaphor comprehension and its connection to cognitive and affective empathy, focusing specifically on individuals with borderline personality disorder (BPD)

Sofia Garbaly, Aleksandra Gorbunova | Independent researchers
Our groundbreaking scientific project aimed to analyse metaphor comprehension and its connection to cognitive and affective empathysh, focusing specifically on individuals with borderline personality disorder (BPD).
Metaphorical thinking has a close relationship to individual’s learning capabilities as it determines an individual’s ability to associate the given concept to an experience outside its environment. For every association that Metaphorical thinking initiates, there exist neuronal activities in prefrontal cortex of the brain, the centre of learning. Greater is the ability of an individual to associate the experience with cognition, greater would be the ability to process the given information processing ability. The lack of ability to associate between experience and cognition results in attention deficiency among learners. The present paper explains that practice of Metaphorical thinking statements during teaching-learning process increases the brain activity in the prefrontal cortex of the brain in turn enables to promote the information processing ability of the individuals.
To unravel the intricacies of metaphor comprehension and empathy in BPD, we meticulously analyzed data from a diverse group of participants, consisting of 20 BPD patients and 20 healthy controls. The comprehensive dataset includes essential demographic information such as gender, age, and right-handedness/left-handedness, along with education levels and self-assessment surveys. Furthermore, we meticulously examined their performance on a meticulously designed metaphor comprehension task that encompassed general metaphors (CM), novel metaphors (NM), and diffuse stimuli (MS).
In our relentless pursuit of scientific rigor, we delved into existing literature and discovered intriguing insights regarding laterality's potential impact on individuals' perception of metaphors. With utmost integrity and a commitment to ensuring the purity of our experiment, we made the decision to exclude left-handed individuals from the general analysis due to their extremely limited representation (only one participant in group 0 and three in group 1). As a result, our analysis focused on the data from 18 participants in each group, ensuring the utmost precision and validity of our findings.
The implications of our research extend far beyond the realm of BPD, enriching our understanding of metaphor comprehension in this particular disorder while making invaluable contributions to the fields of psychology, linguistics, and empathy research. Brace yourself for the profound insights that await, as we study the complex interplay between metaphorical thinking, cognitive and affective empathy, and the potential for transformative learning experiences.

Task 1: Exploratory Data Analysis
Objective: Conduct an exploratory data analysis to identify differences and similarities between the two study groups (BPD patients and healthy controls) in terms of demographic characteristics, age, and level of education.

1. Gather the demographic data of the participants, including gender, age, and level of education.
2. Clean and preprocess the data, ensuring accuracy and consistency.
3. Generate descriptive statistics, such as mean, median, mode, and standard deviation, for each demographic characteristic.
4. Visualize the data using appropriate graphs and charts to compare the distributions of age and education level between the two groups.
5. Conduct hypothesis testing (e.g., t-test, chi-square test) to identify statistically significant differences between the groups in terms of age and level of education.

Task 2: Analysis of Results
Objective: Analyze the obtained data and draw meaningful conclusions regarding metaphor comprehension in BPD patients compared to healthy controls.

1. Review the results of the metaphor comprehension task for both groups (BPD patients and healthy controls).
2. Summarize the performance of each group in terms of their comprehension of general metaphors (CM), novel metaphors (NM), and diffuse stimuli (MS).
3. Identify any notable patterns, trends, or differences in metaphor comprehension between the two groups.
4. Conduct statistical analysis (e.g., analysis of variance, chi-square test) to determine if there are statistically significant differences in metaphor comprehension between the groups.
5. Interpret the findings, highlighting any significant differences or similarities between the groups in relation to metaphor comprehension.

Task 3: Deep Analysis of Data
Objective: Conduct a deep analysis of the data, including statistical processing of the obtained data and studying the relationship between test results and personal indicators such as fantasy, empathy, and degree of disorder.

1. Gather additional data on personal indicators such as fantasy, empathy, and degree of disorder for each participant.
2. Clean and preprocess the data, ensuring accuracy and consistency.
3. Perform statistical analyses (e.g., correlation analysis, regression analysis) to explore the relationship between test results (metaphor comprehension) and personal indicators.
4. Interpret the results, identifying any significant associations or correlations between test performance and personal indicators.
5. Discuss the implications of these findings, highlighting the potential impact of personal indicators on metaphor comprehension and cognitive abilities in BPD patients compared to healthy controls.

Requirements for the team
- Data Visualization – Python, Power BI and Excel;
- Python – Transitions and Operations. Functions, modules and libraries. Lists and tuples. NumPy and pandas libraries;
- Excel – VPR. Summary tables. Functions and formulas. Charts and graphs. Power Query;
- Analytics – AzureML. Learning a scaler from training data, learning a logistic regression model, predicting for a new object, SQL and Excel.

"Sex of the brain" – predicting sex assigned at birth from brain activity

Ilya Zakharov, Alexey Shovkun | Brainify.AI
By analyzing 5,500 magnetic resonance imaging (MRI) structural scans of more than 1400 human brains, Joel et al. (2015) identified specific "female-brain features" and "male-brain features" in the brain. Their findings identified a substantial overlap in the characteristic features of most brain regions, creating a so-called "mosaics" of feminine-masculine traits in each individual. Despite the generally small magnitudes of sex-related brain differences (when structural and lateralization differences are present independent of size, sex/gender explains only about 1% of total variance)and the "continuum view," it has been demonstrated that one's sex can be predicted from brain activity with notable accuracy. For example, Zoubi et al. (2020) demonstrated that the area under the receiver operating curve for sex classification could reach 89% for a logistic regression model trained on the intrinsic BOLD signal fluctuations from resting-state functional MRI (rs-fMRI). In another study, van Putten et al. (2018) showed that a deep neural network trained on electroencephalographic (EEG) rhythmic activity could predict sex from scalp electroencephalograms with an accuracy of over 80%.
For the current project, we propose to develop ML model that will improve the results of van Putten on the EEG data. Several publicly available datasets exist for that purpose (e.g., TUH dataset, Harati et al., 2015, or TD-Brain dataset, Dijk et al., 2022). Both handcrafted EEG features and automatically extracted features (e.g., with DCNN models) can be used for prediction, although the interpretability of the model has to be kept in mind.

  1. EEG data preparation and cleaning
  2. ML model development and validation
  3. Testing the association between outcomes of the developed ML models and metadata available in the dataset (testing the potential of the model as the biomarker)
  4. Investigating interpretability of the model (with focus both on technical and biological aspects)

Requirements for the team
The only hard prerequisit is knowing python. Any type of ML model can be used, although the special focus on cross-validation of the model is important (due to the nature of brain activity). Additional tools that can be helpful MNE-Python (, or EEGLAB. Some libraries that can be used for handcrafted features generation:

Brainify.AI team paper on age prediction from EEG data.

Automating the Detection and Annotation of O-antigen Operons in Prokaryotic Genomes

Polina Kuchur | Independent researcher
The immune response is activated via structures found on the surface of bacterial cells. One such structure is the somatic, or O-antigens, which are remote components of lipopolysaccharides. These O-antigens are essential for bacteria to establish symbiotic relationships with plants. Nonetheless, during the course of evolution, certain pathogenic microorganisms have managed to exploit O-antigens for infecting their hosts. The diversity of this structure has significantly increased over time. Tasks involving comparative genomics demand quick identification of O-antigen operons. We have previously devised a manual method to locate them. However, in this hackathon, our aim is to automate this procedure. YouWe will use complete genome assemblies as input, and the desired output is the identification of somatic antigen operons, complete with annotations and images.

We will provide a brief introduction to the organization and annotation of prokaryotic genomes. Additionally, we'll discuss the crucial O-antigen genes and guide you on how to locate them.

  1. We possess a variety of tools suitable for each stage of analysis. Your initial step will be to evaluate these tools and seek out alternatives if needed.
  2. Your next step involves scripting for parsing the output generated by each program and facilitating its transition to the subsequent phase.
  3. Subsequently, you will establish a comprehensive workflow.
  4. Following the workflow creation, you will rigorously test it to ensure its functionality.
  5. You’ll then develop a user manual detailing the usage and functionalities of our workflow.
  6. Finally, you’ll publish the established workflow, or pipeline, on GitHub for public access and further collaboration.

Requirements for the team
Programming languages, knowledge of particular software, etc.

At least one team member proficient in Python and Linux
Skills that will definitely speed up the project: Knowledge of prokaryotic genetics and genomics, prior experience with O-antigens, familiarity with Snakemake, GitHub, and data visualization using Python. However, it's worth noting that the hackathon provides an excellent opportunity to learn something new!


Hi-C Copilot

Danil Zilov | Independent researcher
With the help of modern sequencing methods, we have learned to assemble genomes to a sufficiently high resolution. Currently, the majority of the time in genome assembly is spent on curating Hi-C maps. During the hackathon, your task is to develop an MVP assistant tool to facilitate the curation process.

Throughout the project, we will demonstrate the source and structure of Hi-C data, how genome curators work with them, and the main challenges encountered in the process. Additionally, we will show you common artifacts found in Hi-C maps.

1. Gain a comprehensive understanding of HIC data, including the concept of a HiC map, the biological principles underlying them, and their significance in bioinformatics beyond just assemblies.
2. Acquire the skills to read and interpret HiC files, and establish the ability to effectively associate them with the corresponding genome assembly.
3. Develop proficiency in identifying and recognizing major artifacts present in HiC files, such as incorrect order of genomic regions, inversions, and merged chromosomes. Explore techniques to detect these artifacts at the HiC file level.
4. Devise an approach or methodology for automatic correction of the identified artifacts. This may involve the use of computational algorithms, statistical models, or other relevant techniques to rectify the detected issues in HiC files.

Requirements for the team
Programming languages, knowledge of particular software, etc.

Minimal: Python.
Would speed up the process: experience with HiC, binary files, OpenCV, Java.


Unraveling the Hidden Dance: An Epistatic Exploration of 15 Million+ SARS-CoV-2 Genomes

Aleksey Komissarov | Independent researcher
This comprehensive project entails an in-depth examination of the SARS-CoV-2 virus with a distinctive focus on its lesser-known genetic composition. Our investigative endeavor will concentrate on the exploration of epistasis, examining the interplay between distinct mutations in the non-Spike genes of SARS-CoV-2. Through detailed data analysis, we aim to discern patterns of co-occurring mutations and comprehend their collective influence on the virus's evolution.

Utilizing advanced data analysis techniques, our goal is to unravel hidden interactions that may elucidate the virus's behavior, its adaptability, its rapid transmission, and its resistance to certain treatments. With the application of machine learning algorithms, we intend to construct predictive models to foresee the possible impacts of these mutations on the virus, thereby informing potential future trends in its evolution.

The culmination of this project will be the generation of a comprehensive report and presentation, encapsulating key insights gleaned from our research. These findings may significantly enhance our strategic response to future viral pandemics.

The following steps will facilitate the achievement of our project goals:
  1. Data Collection: Initiating with the collation of over 15 million SARS-CoV-2 genomes from GISAID and NCBI databases, we aim to assemble as extensive a data set as possible.
  2. Data Preprocessing: This critical phase involves data cleaning and preprocessing, focusing on the detection of mutations in the non-Spike genes of the virus.
  3. Mutation Analysis: With the data duly prepared, we will proceed to identify patterns of co-occurring mutations, furnishing preliminary indications of potential epistasis.
  4. Interpretation of Epistasis: This phase will entail the use of state-of-the-art analytical tools to delineate the interactions between co-occurring mutations, determining whether they are collaborative or competitive.
  5. Predictive Modeling using Machine Learning: Here, we will employ machine learning techniques to construct predictive models to estimate the potential effects of these mutation interactions on the future evolution of the virus.
  6. Model Evaluation and Optimization: Subsequently, we will assess the models for accuracy, performing necessary modifications to ensure optimal performance.
  7. Data Representation: Having obtained a wealth of insights, we will craft a comprehensive report, presenting the data in an informative and compelling narrative, supported by visual aids to better communicate our findings.
  8. Knowledge Dissemination: The final stage involves converting our report into an engaging presentation, elucidating our research journey, its discoveries, and its implications for strategic planning in future pandemic scenarios.

Requirements for the team
Required Skills:
1. Python Programming: essential for handling, cleaning, analyzing, and visualizing data, as well as building machine learning models.
2. Data Analysis: ability to analyze and interpret complex data sets. Familiarity with Python libraries like pandas, NumPy, and matplotlib is a must.
3. Bioinformatics: basic understanding of genomic data, mutations, and viral genetics. Familiarity with tools for genomic data analysis will be beneficial.
4. Machine Learning: knowledge of machine learning algorithms, how to implement them using libraries such as scikit-learn, and how to evaluate and refine their performance.

Skills That Can Speed Up the Project:
1. Advanced Bioinformatics: proficiency in handling large genomic datasets and familiarity with specialized tools for genomic data analysis can expedite the data processing and analysis phases.
2. Experience with Linux and Parallel Programming: knowledge of Linux can aid in handling the large data sets and running complex machine learning models more efficiently. Ability to design and implement parallel algorithms and pipelines can significantly speed up data processing and model training phases.
3. Domain Knowledge in Virology: understanding of viral genetics, specifically regarding SARS-CoV-2, can provide valuable insights during data analysis and model building.
4. Data Visualization Expertise: proficiency in creating engaging and informative visualizations using libraries like seaborn, Plotly, or tools like Tableau can expedite the reporting process.
5. Communication Skills: strong written and verbal communication skills can speed up the final reporting and presentation preparation stage.


Beyond T2T Human Assembly: Resolving Pathogenic Variants in Multicopy Gene Families using WGS

Aleksey Komissarov | Independent researcher
The advent of T2T (Telomere-to-Telomere) human assembly has revolutionized the field of genomics by providing fully assembled genes, including segmental duplications and multicopy genes. These regions, which were previously challenging to assemble, often contain crucial genes with significant implications for human health. In this two-day hackathon project, your task is to estimate the number of pathogenic variants present in these multicopy genes that can be resolved using short reads generated from Whole Genome Sequencing (WGS). Additionally, you need to determine the minimum read length required for the accurate resolution of variants in these genes. Furthermore, you will investigate the number of known variants that are located within the hard-to-unique resolve regions.

Identification of challenging loci in hg38 and T2T assemblies (e.g. the longest common substring over hamming distance problem and an interval trees):
  1. Identify genomic regions in the hg38 assembly with sequence repeats exceeding the length of raw reads and/or popular insertion sizes.
  2. Identify loci that are single copy in hg38 but multicopy in the T2T assembly.
  3. Perform an intersection analysis with ClinVar variants to identify variants with known effects.
  4. Identify regions within multicopy genes that present challenges for unique resolution.
  5. Assess the impact of unresolved variants in these regions on clinical interpretations and potential disease associations.
If these tasks are completed quickly, additional tasks can be solved.

Requirements for the team
Required Skills:
1. Python Programming: essential for handling, cleaning, analyzing, and visualizing data.
2. Data Analysis: ability to analyze and interpret complex data sets. Familiarity with Python libraries like pandas, NumPy, and matplotlib is a must.
3. Bioinformatics: basic understanding of genomic data, mutations, and viral genetics. Familiarity with tools for genomic data analysis will be beneficial.

Skills That Can Speed Up the Project:
1. Advanced Bioinformatics: proficiency in handling large genomic datasets and familiarity with specialized tools for genomic data analysis can expedite the data processing and analysis phases.
2. Familiarity with genomic assemblies: prior knowledge and experience working with genomic assemblies like hg38 and T2T will facilitate the identification of challenging loci and the interpretation of results.
3. Familiarity with ClinVar data: prior knowledge and experience working with human variants will facilitate the identification of challenging loci and the interpretation of results.
4. Data Visualization Expertise: proficiency in creating engaging and informative visualizations using libraries like seaborn, Plotly, or tools like Tableau can expedite the reporting process.
5. Communication Skills: strong written and verbal communication skills can speed up the final reporting and presentation preparation stage.


Predicting multifactorial phenotypes: better breeding

Lavrentii Danilov, Mikhail Rayko | Independent researchers
A large number of organismal traits are characterized by their polygenic nature, i.e., the manifestation of phenotypes depends on several mutations. There are a number of methods that make it possible to determine the so-called epistatic pairs - associated SNPs that affect the same trait. In this project, we want to analyze the results of such methods and check whether they can be used to determine the optimal sets of SNPs to obtain a given phenotype.

  1. Parse the results of pairwise association analysis of SNPs (e.g., MIDESP result) associated with a particular quantitative phenotype
  2. Represent the results as a graph of associated SNPs and loci
  3. Devise an algorithm that looks for potential SNP combinations to maximize/minimize the manifestation of the original phenotype
  4. Wrap it all up in a convenient pipeline and get an application for SNP-guided CRIPSR-associated selection

Requirements for the team
Knowledge of at least one programming language (Python, Java), fundamentals of genetics. Ideally, experience with GWAS data and graph analysis.

Auxological predictors of intrauterine fetal growth restriction

Dmitrii Zhakota | Independent researcher
A significant medical and demographic problem is fetal growth retardation (FGR). Specialists in ultrasound, obstetrics and gynecology, neonatology, and pathological anatomy are searching for metrics to diagnose FGR.

Despite the multidisciplinary approach, it has not been possible to develop conclusive criteria that can be used by all specialists at any stage of fetal development. We will test hypotheses about the accuracy of FGR assessment from the perspectives of pathologists and neonatologists using a dataset of adverse pregnancy outcomes.

Task 1: Exploratory data analysis
Objective: Generate basic summary tables and visualization of data. Get preliminary about associations between FGR event and various features.

Task 2: Inferring FGR probabilty
Objective: Using conventional biostatistical approaches estimate significance and effect size of various factors potentially impacting FGR.

Task 3: Assessment of existing scores
Objective: Basing on the dataset estimate applicability of various exisiting scores to FGR diagnostics.

Task 4: Development of new diagnostic score for FGR
Objective: apply modern biostatistics and machine learning approach for the development of a new score, ouperforming existing approaches.

Requirements for the team
Knowledge of the R/Python language and skills in data visualization, statistics and machine learning are desirable. No medical knowledge required – we will understand everything on the spot through simple examples and analogies.

Retrospective analysis of complex disease trajectories in patients after stem cells transplantation

Ivan Moiseev | Independent researcher
Graft versus host disease (GvHD) is a complication that might occur after an transplantation of bone marrow. GvHD is characerized with a high prevalence and noticable probability of death. That why the analysis of various factors impacting GvHD probability remains topical problem in modern oncohematology. In the current project we propose to analyse a unique dataset containing data about GvHD diagnosis and treatment collected across 4 transplant centers in Russia. No statistical analysis of the data was done yet, so the team will make the first and important step in mining new insights in this field.

Task 1: Gathering summary statistics of data (distribution of age/sex/nosologies etc. accross centers and outcomes).

Task 2: Perform basic survival analysis to understand factors affecting mortality, NRM, response to therapy.

Task 3: Performed advanced analysis to get more complex associations taking into account several lines of therapy and confounding factors in multiple parameters space.

Requirements for the team
Programing in R. Knowledge of non-parametric tests, survival analysis, machine learning–based modeling.

Application of natural language processing methods for parsing histological and endoscopic studies results in oncohematological patients

Ivan Moiseev | Independent researcher
Graft versus host disease (GvHD) is a complication that might occur after an transplantation of bone marrow. GvHD is characerized with a high prevalence and noticable probability of death. That why the analysis of various factors impacting GvHD probability remains topical problem in modern oncohematology. However some rich piece of data contating text results of some clinical procedures (mainly histology and endoscopy) is still non used for GvHD analysis. Thus machine text analysis is required. The goal of the project is to determine key characteristics in text descriptions for final diagnosis, predictors of response to therapy and survival. The complexity of the data set is determined by simultaneous endoscopy of lower and upper gastrointerstinal tract in some patients at single time and re-biopsies in the other and presence of GVHD manifestations outside of gastrointestinal tract.

Primary endpoint:
Determine key features in text description of histological and endoscopic studies that predict clinical diagnosis and response to therapy

Secondary endpoints:
- None-relapse mortality predictors
- Overall survival predictors
- Response predictors

Requirements for the team
Experience in R/Python programming with:
- Machine text analysis
- Multivariate modeling
- Cluster analysis

Analysis of T-cells immunological landscape and its association with graft-versus-host disease

Mikhail Drokov | Independent researcher
Transplantation of allogeneic hematopoietic stem cells allows achieving a biological cure in patients with leukemia. But the main reason for the decrease of disease-free survival and quality of life is chronic graft-versus-host disease (chronic GVHD), which develops in more than half of patients. Chronic GVHD remains a clinical diagnosis but attempts are being made worldwide to find diagnostic markers of this complication. The pathogenesis of this condition is not fully understood, but the role of T-lymphocytes has been proven.

A large dataset of immunological data is presented (analysis of the subpopulation composition of T-lymphocytes - 162 subpopulations) for 70 patients after stem cells transplantation. Chronic GVHD developed in 28 of these 70 patients.

Purpose: application of modern data processing and visualization technologies for the prediction and characterization of chronic GVHD.

Task 1: basic preprocessing of initial dataset

Task 2: choosing and implementation of dimension reduction methods for visualization of data

Task 3: clustering of T-cells profiles in order to estimatie basic “immunological portraits” of the patients.

Task 4: Find associations between immunological portrait and GvHD development.

Requirements for the team
- experience in statistical analysis in R/Python
- machine learning
- visualization of multi-dimensional data
- cluster analysis

Evaluation of KIR-receptor impact on graft function disruption after allogeneic hematopoietic stem cell transplantation

Ekaterina Mikhaltsova | Independent researcher
Allogeneic hematopoietic stem cell transplantation allows to achieve a cure in patients with blood system diseases. Disruption of normal graft function is a group of complications after transplantation, which significantly worsens the prognosis and quality of life of patients.

All over the world there are attempts to find diagnostic markers of this complication. The pathogenesis of this condition is not fully understood, there are sporadic works on the influence of NK cells, in particular of KIR-receptors, on the development of graft failure.

Task 1: Identify landscape of in donor-recipient pairs and their combination.

Task 2: Find associations between types of transplantation and KIR landscape.

Task 3: Using biostatistics and machine learning methods identify the relationship between KIR in donor-recipient pairs and transplant outcomes.

Task 4: Find associations of KIR clusters and the function of the transplant.

Requirements for the team
- experience in statistical analysis in R/Python
- machine learning
- visualization of multi-dimensional data
- cluster analysis

Protein structure prediction using deep learning

Aleksei Artemiev | SimpleFold
The goal of the project is to develop a system to predict second protein structure using known open-source instruments (ESM, AlphaFold, ProtTrans, and Ankh pretrained LLM models). It should have a good API interface and visualization functions (additional task). Unfortunately, there is no common and convinient libraries that provide such a functional.


1. Collect SOTA approaches into a pipeline;
2. Create convinient inference pipeline (similar to Hugging Face);
3. Develop user friendly API;
4. Embed visualization functions (additional task);
5. Prepare an example of the library usage for a protein in FASTA format.

Requirements for the team
Software Engineers (Python, Flask/Django, storages), Data Scientists (data manipulation and visualization), Bioinformatics (protein engineering/design, sequence manipulation algorithms), ML Researchers and Engineers (NLP: LLM models, RNNs) and domain experts are welcome to join!

Identification of New Aging Biomarkers using RNA-seq data analysis and Machine Learning

Iaroslav Mironenko | Independent researcher
This bioinformatics project integrates RNA-seq data analysis and machine learning to identify new potential biomarkers of aging. By applying rigorous computational methods and validation experiments, the project aims to contribute to our understanding of the aging process and potentially provide insights into the development of therapies targeting age-related diseases.

Data Collection and Preprocessing:

Acquire RNA-seq data from publicly available aging-related studies, ensuring diverse tissue types, age-releated diseases and age groups are included.
Conduct quality control measures to remove outliers and low-quality samples, which may involve examining sequencing metrics, removing samples with low read counts, or checking for batch effects.
Perform read alignment against a reference genome and quantify gene expression levels using tools such as STAR, HISAT2, or Salmon.

Differential Expression Analysis:

Employ statistical methods like DESeq2, edgeR, or limma to identify genes that exhibit significant differential expression with age.
Apply appropriate statistical tests to determine significant age-related gene expression changes, considering factors such as fold-change and adjusted p-values.
Generate a list of potential aging-associated genes based on the statistical significance of their expression changes.

Feature Selection and Dimensionality Reduction:

Employ feature selection techniques to reduce the number of genes and focus on the most informative features for aging prediction.
Utilize methods like variance thresholding, mutual information, or recursive feature elimination to identify genes that contribute most to age-related variations.
Perform dimensionality reduction techniques such as PCA or t-SNE to visualize the data in lower-dimensional space and detect underlying patterns or clusters.

Machine Learning Model Development:

Find new, previously unknown biomarkers using machine learning, based on the similarity of the found biomarkers using rna-seq data analysis

Split the data into training and testing sets to evaluate model performance.
Explore a range of machine learning algorithms, such as random forest, support vector machines, logistic regression, or neural networks.
Train the models using the training set, optimizing hyperparameters through techniques like grid search or Bayesian optimization.
Assess the performance of the models using appropriate evaluation metrics such as accuracy, precision, recall, F1-score, or area under the receiver operating characteristic curve (AUC-ROC).

Biomarker Validation and Interpretation:

Validate the selected machine learning model using independent datasets or through cross-validation techniques to assess its generalization performance.
Interpret the model's feature importance scores or coefficients to identify the most informative genes (biomarkers) for aging prediction.
Conduct gene ontology enrichment analysis or pathway analysis to explore the biological functions and pathways associated with the identified biomarkers, gaining insights into the underlying mechanisms of aging.

Requirements for the team
R or Python programming, skills of NCBI using, GEO particularly, RNA-seq data analysis skills.

LLM on the lam: are pre-trained embeddings good representations to argue about protein stability?

Tatiana Malygina, Mariia Lomovskaia | Helmholtz-Institute for Pharmaceutical Research Saarland, independent researcher
Currently there is a zoo of large language models available, some of them can even speak protein language. Also there are numerous models based on the transformer architecture with some geometric deep learning flavor which utilize protein structural information. In this project we propose the participants to play with some of them, to apply them to the mini-proteins dataset by Rocklin et al., and to see what approaches are better for this task – the ones based on human knowledge (can be found in the literature) vs. simple ones with the features learnt by a LLM who was given 200 TPUs and has seen too much (the participants will try to implement them during the hackathon).

The dataset by Rocklin et al. can be utilized in different settings. It has pairs of protein sequences (wildtype and mutated) with the corresponding stability score. Also, the pdb files with the protein structures are available. For the simpler setting which is doable during hackathon I propose to use protein sequence pairs.
1. Pick one or several protein language models. I suggest using ESM2 (there is a relatively small model available) and/or Ankh model by Rostlab (the participants might want to use something different here).
2. Exploratory analysis: compute the embeddings produced by pretrained protein language model; try to do dimensionality reduction and see how the data with different secondary structure types or different stability score values are packed in the embedding space.
3. Solve protein stability prediction as a metric learning task
4. Compare your team’s stability prediction results with the results of other researchers, which can be found in the literature