Projects | Public Health Hackathon 2025

Public Health Hackathon'2025

Projects
Kazakhstan, Almaty | 8 – 10 August 2025

If you haven’t received the email with the project selection form,

please contact the organizers: hack@bioinf.institute

This project invites participants to conduct an exploratory data analysis based on survey data collected by NAC Analytica research center — a cross-sectional monthly survey of Kazakhstani residents conducted from 2016 to 2021. Participants are encouraged to carry out a mini-research project in the field of sociology of health or public health, using the provided dataset. There is no predefined research question. We leave space for creativity and independent inquiry, but the teamwork will be organized by mentor (Vladimir Kozlov, Nazarbayev University). As a final output, we expect a Tilda-based website or a set of presentations that clearly communicate your findings and insights.

Requirements for the team:
1. Basic programming skills in R and / or Python.
2. Skills in statistics.

Sepsis remains a leading cause of morbidity and mortality in intensive care units (ICUs) worldwide. Timely recognition and intervention are critical for improving outcomes, yet the complexity of clinical data and the rapid progression of the condition make early diagnosis challenging. A well-designed decision support system (DSS) can assist healthcare professionals by analyzing patient data in real time and offering evidence-based recommendations for diagnosis and treatment.

The goal of this project is to develop, implement, and validate a clinical decision support system aimed at assisting intensive care unit physicians in the early identification of septic patients at high risk of fatal outcomes. The system will integrate patient monitoring data, laboratory test results, and information on comorbidities to generate timely alerts and support clinical decision-making.

Tasks:
1. Data preprocessing
2. Feature selection
3. Model development
4. Model evaluation and validation

Requirements for the team:
R, Python, GitHub, Machine Learning Alghorithms.

The project aims to develop an AI-powered R/Python package that automatically generates natural-language interpretations of statistical tables (e.g., regression outputs, summary statistics, p-values, confidence intervals). By leveraging large language models (LLMs), the tool will assist researchers, clinicians, and data analysts in understanding and communicating the meaning of statistical results, reducing misinterpretations and increasing accessibility of quantitative findings. The system will be adaptable to various statistical frameworks and support context-sensitive interpretation with options for technical or layman output.

Tasks:
1. Investigate typical table formats of popular statistical tools (e.g., from gtsummary, broom, stargazer etc.)
2. Build preprocessing module to:
- detect statistical test/model type;
- extract key figures (e.g., coefficients, p-values, CI);
- identify variable types (categorical, continuous).
3. Design dynamic prompts tailored to statistical content (descriptive statistics, hypothesis testing, estimations etc.) as well as a domain field.
4. Create R wrappers for seamless integration in notebooks or reports.
5. Combine the result into R package and share via GitHub.

Requirements for the team:
The following competencies should be covered by team members:
1. Good level in R programming.
2. Experience in statistical data analysis.
3. Expertise in at least one biological or medical field
4. Interest to LLM-based AI models.

Prediction of the effect of a genetic variant on the gene (protein) function is one of the most important tasks in genetics (and, especially, medical genetics). While current software tools (including deep learning-based ones) have become extremely good at solving the task of predicting the deleterious effect of mutations for protein-coding genes, there is still a need for a tool that could provide an aggregated, condensed representation of genetic variation patterns for individual genes (for example, showing all known genetic variants and their frequencies on a 2D/3D protein structure, aggregating information from both databases and scientific literature). In this project, the main goal is to create such a tool.

Tasks:
1. Conduct a research into the available tools and datasets that provide aggregate information about genetic variation, information on gene/protein structures;
2. Design an approachf for automatically retrieving information for a given gene from all of the datasets/literature (e.g., using APIs/NLP methods (for literature));
3. Create a function for visualizing the distribution of genetic variants from different sources on the same diagram that shows a detailed 2D structure of a gene (including protein domain boundaries, important motifs, etc.);
4. Develop a method for representing genetic variation patterns on a 3D protein structure;
5. Implement a simple web application that provides a user with 2D/3D pictures of genetic variation patterns for a given gene.

Requirements for the team:
Most team members should have good understanding of genetics and molecular biology. Some team members should have good data visualization skills (in R/Python).

Experience with handling 3D protein structures is required for at least one team member. Similarly, at least one member of the team is required with knowledge of modern natural language processing (NLP) methods.

Some members of the team should have basic skills in setting up simple web application.

Rapid advancement of genomic technologies has led to accumulation of massive amounts of data regarding gene function, evolution, and beyond. While there are many databases that store information about genes, there is no one-stop resource that aggregates all of the available gene-level information, including the vast number of numerical characteristics. In this project, the team will work on aggregation of diverse gene-level features (metrics of evolutionary conservation, within-population variation, expression patterns, and beyond) to construct an interactive service for browsing all of the available information about human genes.

Tasks:
1. Conduct a thorough literature-based research into the available gene-level features, including matrics related to gene and protein function, genetic variation, evolution, expression, phenotype, etc.;
2. Construct an aggregated dataset that combines all of the features for all human genes;
3. Conduct a detailed exploratory analysis of the data to determine clusters of reelated metrics, features that predict the gene function/phenotype, etc.;
4. Construct an interactive web-service that allows browsing the collected data.

Requirements for the team:
Most team members would need good understanding of molecular biology and basics of evolutionary biology/genetics. Most team members should have knowledge of basic statistics, some team members should have machine learning skills. We also need at least one team member that has experience with setting up simple web applications.

Natural selection is the most important driving force of evolution which can both eliminate deleterious alleles from the population and lead to spread or fixation of beneficial ones. A plethora of methods have been proposed for identification of signatures of positive selection using both individual sequences and information about sequence polymorphism in natural populations. For the human genome, a variety of databases have been established that store the information about positive selection (e.g., PopHumanScan, 10.1093/nar/gky959); however, there is no one-stop resource that aggregates information about signals of recent positive selection detected in other species. The main goal of this project is to create such a database and perform an exploratory analysis of its contents.

Tasks:
1. Conduct an extensive literature search to find all publications reporting recent positive selection signals in different species;
2. Retrieve all available data provided in these publications (e.g., files containing results of positive selection scans, tables listing all significant selection signals);
3. Harmonize the data by converting all identified signatures of selection into genomic intervals, annotate intervals with gene names;
4. Perform an exploratory analysis of the aggregated data (prevalence of different groups of genes, their functional features);
5. Establish a simple web-interface for browsing the data.

Requirements for the team:
Most team members should have a sufficiently strong molecular biology background and a basic understanding of evolutionary biology/genetics. Some team members should have data analysis skills (in either R or Python). Skills with developing simple web applications and database management would be beneficial for some team members.

This project introduces the first framework that simultaneously tackles peptide–MHC (P–M), peptide–TCR (P–T) and peptide–MHC–TCR (P–M–T) link-prediction as a single multimodal graph problem. Uniquely, it:
• Integrates pretrained language models to encode textual annotations and protein/DNA sequences,
• Applies graph contrastive learning to align node embeddings with the underlying interaction topology,
• Leverages knowledge-graph embedding methods for end-to-end link prediction.

The goal is to learn universal node embeddings—the same representations for peptides, MHC alleles and TCRs—that achieve state-of-the-art accuracy and robust generalization on entirely unseen entities. By unifying all three interaction types and fusing multiple data modalities in one graph-based model, this approach goes beyond prior work by incorporating rich textual and chemical information via contrastive pretraining, yielding embeddings that are both highly predictive and broadly transferable.

Tasks:
1. Modality Embedding
• Load cached embeddings:
• Text (epitopes, MHC annotations) → BioBERT
• Protein/Gene sequences → ESM/ProtBert
• Validation: check tensor shapes, data types, and absence of NaNs
• Prepare GNN input by assembling a feature dictionary for each node
2. Fusion Module
• Implement simple averaging of modality embeddings for each node
• Ensure the resulting vector has the correct dimensionality and contains no invalid values
• Integrate the fusion operation into the pipeline immediately before the GNN
3. Graph Neural Network
• Configure a single‐layer GNN (e.g. GraphSAGE or equivalent) to update node embeddings using three edge types
• Verify that the updated embeddings are computed correctly for peptides, MHC, and TCR nodes
4. Prediction Heads & Negative Sampling
• Add three prediction “heads” for P–M, P–T, and P–M–T interactions
• Define a negative‐sampling strategy for each task by replacing one component of each true interaction
5. Training & Optimization
• Build a unified training loop that combines fusion, GNN, and prediction heads
• Select and fix hyperparameters (number of epochs, batch size, learning rate)
• Run a quick trial to confirm that loss decreases and AUC metrics improve
6. Evaluation & Visualization
• Evaluate performance using ROC-AUC and PR-AUC for P–M, P–T, and P–M–T
• Produce qualitative embedding visualizations (t-SNE or UMAP)
• Prepare a report with a table of results and 1–2 illustrative plots
7. Optional (time permitting)
• Add a Graph Contrastive Learning step with a contrastive loss between pre‐ and post‐GNN embeddings
• Implement a KGE model (TransE or DistMult) for direct link prediction as an alternative to the simplified BCE approach

Requirements for the team:
Proficient in Python 3.7+, strong experience with PyTorch, comfortable using HuggingFace Transformers (BioBERT, ESM/ProtBert), understanding of Graph Neural Networks (e.g. GraphSAGE, GAT), basic knowledge of contrastive learning methods on graphs, version control with Git/GitHub, environment and dependency management (conda, virtualenv), basic bash/linux command-line skills.

Goal:
Build actionable insights and predictive tools to improve clinical workflows and optimize drug supply planning for oncology patients.

Challenge Overview:
You’ll work with a rich, real-world dataset from hospital records to tackle a critical problem in oncology care planning. The focus is on patients diagnosed with hematologic and lymphoid malignancies (ICD-10 codes C81–C99). Your mission: uncover treatment patterns and build predictive tools to support data-driven decisions in resource allocation and patient care.

Available Data Sources:
- Drug utilization records
- Laboratory test results
- Medical procedures and manipulations
- Hospitalization data (including admissions, discharges, departments, and financial sources)

Tasks:
1. Map treatment trajectories: identify common care pathways by linking diagnoses with drug use, lab testing, and procedures across different departments.
2. Spot timings: determine which departments have the longest delays from admission to first treatment, lab test, or procedure.
3. Predictive modeling: develop a model to forecast drug usage based on diagnosis, length of stay, department, seasonality, and type of funding (e.g. insurance, government, private).

Requirements for the team:
Experience of working with classical statistical models, ML, prediction and forecasting. It is desirable to have at least one clinician on the team.

PregnAIncy is an AI-powered mobile platform that provides personalized prenatal care support for pregnant women. It offers weekly health recommendations, intelligent Q&A with a GPT-based chatbot, offline SMS support for rural areas, and emergency alert features. pregnAIncy aims to make maternal health accessible, reliable, and intelligent — even in low-resource settings.

Tasks:
1. Problem Research & User Interviews
2. ⁠AI Chatbot Development
3. ⁠Weekly Health Recommendation Engine
4. ⁠Mobile App Development
5. ⁠Emergency Alert System Integration
6. ⁠Voice & SMS Support
7. ⁠UI/UX Design
8. ⁠Pilot Testing & Feedback Collection

Requirements for the team:
1. Programming & Technical Development
2. ⁠Artificial Intelligence Machine Learning
3. ⁠UI/UX Design
4. ⁠Mobile & Offline Communication Systems
5. ⁠Data Privacy & Ethics
6. ⁠Healthcare / Public Health Expertise
7. ⁠Testing & Impact Measurement

This project aims to develop an open-source R-based tool (as a Shiny app, package, or just script with functions) for the statistical planning of diagnostic accuracy studies using ROC curve analysis. ROC curves and AUC (Area Under the Curve) are widely used in public health and clinical epidemiology to assess diagnostic test performance. However, many researchers lack accessible tools to properly calculate required sample sizes, particularly when dealing with non-standard hypotheses or multiple testing corrections.

The proposed tool will allow users to:
1. Estimate the sample size needed to detect whether an AUC exceeds a specified null value (not limited to AUC₀ = 0.5).
2. Compare two AUCs with defined statistical power and significance levels.
3. Adjust sample size calculations to control for multiple hypothesis testing (e.g., using FDR or Bonferroni correction).
4. Calculate the sample size required to achieve a target width of 95% confidence intervals for sensitivity, specificity, PPV & NPV.

The tool can combine analytical formulas (Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982 Apr;143(1):29-36. doi: 10.1148/radiology.143.1.7063747) and simulation-based approaches.

It will be valuable to researchers designing diagnostic studies in public health and clinical settings.

Tasks:
Develop sample size calculation functions for testing AUC > AUC₀, where AUC₀ can be any threshold (e.g., 0.5 or 0.7), with user-defined Type I and Type II error rates.

1. Implement functionality for estimating the required sample size for comparing two ROC curves.
2. Enable sample size estimation based on target width of 95% confidence intervals for sensitivity and specificity.
3. Integrate multiple testing adjustment methods (e.g., FDR control) into the planning process.

Additional: Build a user-friendly interface (Shiny app or documented R package) with pre-loaded templates and example datasets.

Requirements for the team:
– Proficiency in R programming, including function development; Shiny experience is a plus.
– Knowledge of ROC analysis and diagnostic accuracy metrics.
– Understanding of statistical power and sample size principles.
– Experience with simulation studies is beneficial.

Problem:
Up to 50% of patients do not adhere to their medication regimen. Reasons: forgetfulness, lack of reminders, lack of external control. This leads to poorer treatment outcomes and increased healthcare costs.

Goal:
To create a prototype digital solution (mobile PWA application) that allows:
- Patients: receive reminders about taking their medication and register the fact using a QR code;
- Doctors: receive an exportable CSV report on medication intake;
- Biologists/analysts: collect standardized data on behavior during therapy.

Definition of Done:
- PWA application (working offline)
- QR code generation based on the treatment regimen (dosage, schedule)
- QR code scanning — appointment registration
- Notifications (via Notification API or Firebase)
- Storage of appointment history in Firestore
- CSV report export for the doctor
- Documentation + demo + CI pipeline (optional)

Tasks:
0–4 HOURS – UX/UI DESIGN, TECHNICAL DIAGRAM, AND MOCKUPS
- Patient interface design: adding medication, appointment list, QR scanning.
- User interaction scenarios.
- Architectural diagram: React + Firebase + QR.
Responsible: Sociologist, Bioinformaticians 1 and 2

4–10 HOURS – FIREBASE AUTH AND FIRESTORE, BASIC STRUCTURE
- Firebase project setup.
- Authentication implementation (email or anonymous).
- Firestore: users, medications, intakes structures.
Responsible: Bioinformatician 2

10–18 HOURS – QR CODE GENERATION/SCANNING
- QR code generation according to the appointment schedule.
- Implementation of a QR scanner.
- Linking scanning to appointment registration.
Responsible: Bioinformatician 1

18–26 HOURS – INTERFACE FOR REGISTRATION AND ADDING MEDICATIONS
- UI for adding medications (name, dose, time).
- Intake history and filtering.
- Manual and QR intake registration.
Responsible: Bioinformatician 1, Biologist 1

26–32 HOURS – SCHEDULED PUSH NOTIFICATIONS
- Notifications via FCM or Notification API.
- Linking notifications to time from Firestore.
- Configuring notifications via UI.
Responsible: Bioinformatician 2

32–38 HOURS – CSV EXPORT, TESTING
- Report export (name, date, status).
- “Download CSV” button.
- Conducting tests: manual and automated tests.
Responsible: Bioinformatician 2, Biologist 2

38–44 HOURS – UI POLISH, BUG FIXES, DOCUMENTATION
- Improving the interface.
- Fixing bugs.
- Writing project documentation.
Responsible: The entire team

44–48 HOURS – DEMO AND FINAL PRESENTATION
- Preparing the demo and presentation script.
- Summarizing the project results.
- Presenting the benefits for patients, doctors, and researchers.
Responsible: The entire team

Requirements for the team
Biologist
- Knowledge of pharmacology principles and treatment regimens
- Ability to formalize medication regimens
- Understanding of adherence/compliance factors
- Participation in testing medical interfaces
- Basic skills in working with Excel or CSV

Bioinformaticians (2)
- Working with tables and structured medical data
- Knowledge of Python or R (for data analysis and preparation)
- Experience working with CSV and JSON formats
- Skills in creating data collection and preprocessing pipelines
- Basics of data visualization (e.g., matplotlib, seaborn, ggplot2)
- Understanding of adherence, persistency, and drop-off concepts
- Experience preparing data for ML (optional)

Sociologist
- Developing and analyzing questionnaires and UX surveys
- Designing onboarding and user scenarios
- Skills in statistical analysis of survey data
- Working with Google Forms / SurveyMonkey / similar tools
- Basic proficiency in Figma or Miro (for mapping the user journey)
- Writing texts for onboarding and instructions

Tertiary Lymphoid Structures are critical immune hubs within tumors, and their presence is a strong predictor of favorable patient prognosis and response to immunotherapy. This project aims to leverage public spatial transcriptomics (ST) datasets to develop and benchmark computational methods for accurately identifying and characterizing TLS in solid tumors. The primary goal is to build a robust pipeline for processing ST data, applying both signature-based scoring and machine learning models to detect TLS regions. By analyzing the gene expression profiles in their native spatial context, students will gain hands-on experience with cutting-edge bioinformatics techniques and contribute to a clinically relevant challenge in immuno-oncology. The project will focus on exploring the biological heterogeneity of TLS and validating findings using computational deconvolution to confirm the underlying immune cell composition.

Tasks:
1. Identify, download, and preprocess public 10x Visium spatial transcriptomics datasets of solid tumors (e.g., lung, kidney, breast cancer) from repositories like GEO and Zenodo. Perform quality control, normalization, and data structuring.
2. Apply unsupervised clustering algorithms to identify spatially distinct tissue domains. Annotate clusters based on the expression of canonical marker genes for immune (e.g., PTPRC, MS4A1, CD3D) and stromal cells to locate potential lymphoid aggregates.
3. Implement and score published TLS-associated gene signatures (e.g., 12-chemokine signature) on a per-spot basis. Visualize the resulting TLS scores as spatial heatmaps overlaid on the corresponding H&E tissue images.
4. Using datasets with expert annotations (e.g., Zenodo: 10.5281/zenodo.14620362 ), train and validate a supervised machine learning model (e.g., Support Vector Classifier) to predict TLS-positive spots from their gene expression profiles.
5. Use computational deconvolution tools (e.g., Kassandra) to estimate the cellular composition of predicted TLS regions. Validate that these regions are enriched in B-cells, T-cells, and dendritic cells, confirming their biological identity as TLS.

Requirements for the team

Proficiency in Python or R.
Experience with bioinformatics analysis packages, specifically Scanpy for Python or Seurat for R.
Familiarity with machine learning libraries (e.g., scikit-learn, PyTorch), data visualization tools (matplotlib, ggplot2), and version control (Git/GitHub).
A foundational understanding of cancer biology, immunology, and genomics. Prior experience analyzing transcriptomics data (scRNA-seq, bulk RNA-seq) is a strong asset.

The reproducibility crisis in biomedical research undermines the reliability of evidence used for screening, treatment, and public health policy. Non-replicable findings waste resources, misguide interventions, and erode public trust. This issue becomes even more pressing with the rising use of large language models (LLMs) — both for generating results and writing research texts. That’s why we’re proposing REPtile — to tame the LLM and make it serve reproducibility, not destroy it.

The goal is to build a tool that takes a public-health article (or a fragment of it) and returns a probability that the described key result can be independently reproduced.

The core model will be based on a fine-tuned scientific language model (e.g., SciBERT or PubMedBERT), trained on open reproducibility datasets such as the DARPA SCORE market scores.

The result will be a proof-of-concept tool that helps editors, funders, and researchers quickly flag fragile evidence — and, in the long term, could be used to strengthen the scientific basis for health decisions and funding allocation at the governmental level.

Goals are as follows:
1. Build a reproducibility-scoring model that outperforms the out-of-the-box baseline.
2. Estimate the model on held-out manually curated set.
3. Provide model interpretation, so domain experts can understand which parts of the text drive the predictions.
4. Document the full pipeline so it can be scaled or reused once more labeled data becomes available.

Tasks:
1. Data preparation and exploratory data analysis (EDA):
- "Know Your Data" is the core principle of this project. The team is expected to perform EDA on all provided datasets.
- Preprocess and tokenize article fragments if needed.
- Perform a clear train–test split strategy to model a realistic out-of-distribution scenario for the test and validation sets, and avoid leakage.

2. Model fine-tuning
Fine-tune a pre-trained scientific LLM (e.g., SciBERT) to predict reproducibility scores based on claim-level annotations. For lower-resource setups, the lower layers can be frozen, training only the top transformer blocks and a regression head.

3. Benhmarking on validation and test data
Train and compare several setups for validation data (part of the dataset from which training data comes from) and test data (manually estimated reproducibility data):
- Out-of-the-box embeddings + simple regressor
- Fine-tuned model embeddings + simple regressor
- Direct predictions from the fine-tuned model

4. Interpretability and reporting
- Use SHAP, attention maps or other approaches to highlight which parts of the text influenced predictions.
- Summarize results and prepare the report

Requirements for the team:
First and foremost, team members should be genuinely enthusiastic about the project.
As for hard skills required, Python programming and basic data analysis skills with experience in machine learning are an absolute must.

Ideally, team members should have experience in the following areas:
1) Natural Language Processing — understanding of tokenization, embeddings, and transformer-based models.
2) Model evaluation — familiarity with regression metrics (e.g., MAE, RMSE, R²) and benchmarking.
3) Interpretability methods — such as SHAP, attention analysis, saliency maps for text, or other text explanation methods.
4) Reproducible research practices — ability to document code clearly for future use and scaling.

Hands-on experience is strongly preferred over purely theoretical knowledge.

Across the life-span many genes follow reproducible age-dependent trajectories: stress-response genes rise early, mitochondrial transcripts fall late, and dozens of splicing factors shift isoform choice. We will study a complete set of RNA-seq data for several mouse tissues sampled at different ages. By merging these layers we can map, for each tissue, which genes and isoforms turn up or down with age, when the key switches occur, and which variants affect those changes.

Tasks:
– Perform gene and transcript level quantification (command-line tools);
– Call age-linked isoform switches (R packages);
– Fit mixed models per tissue to identify early and late expression and splicing inflection points (R packages, basic ML);
– Combine gene, isoform and SNP signals (R, Python, command-line tools);
– Implement a lightweight dashboard showing time-course plots (R, Python).

Requirements for the team:
– Solid knowledge in Linux, command line and working on remote servers;
– Basic knowledge in molecular biology;
– Basic skills in transcriptomic data analysis;
– Basic skills in data analysis and visualization with Python and R.

HIV Info helps in HIV-related emergencies.

It is a website that helps to reduce the time between when someone suspects they may have HIV and when they start treatment. HIV Info provides simple and easy-to-understand recommendations for emergency prevention, HIV testing, and psychological support.

Tasks:
1. Analyze the clinical recommendations of the Ministry of Health of the Russian Federation and the WHO on HIV infection to develop an appropriate flow;
2. Develop the website and backend architecture (UI/UX, Python, Django, FastAPI, PostgreSQL, Redis) along with an LLM-based chatbot using prompt engineering, RAG, and post-processing to ensure accurate and safe medical advice;
3. Evaluate social and economic outcomes of HIV Info;
4. Evaluate the feasible costs related to the project realization.

Requirements for the team:
1. Basic knowledge of healthcare administration and economics;
2. Knowledge of econometric models for estimating outcomes (CEA, QALY, etc.);
3. UI/UX skills;
4. Python for backend development, data processing, ML integration, and implementing RAG pipelines;
5. Django and FastAPI for efficient web service development and deployment;
6. PostgreSQL and Redis for managing structured healthcare data, caching, and task queues;
7. LLMs and prompt engineering for medical chatbots, with RAG integration and post-processing to ensure accurate and safe advice for a sensitive user group.

Irritable Bowel Syndrome (IBS) is a prevalent functional gastrointestinal disorder associated with microbiome imbalance. Recent research suggests that microbial metabolism of dietary components can influence gut health, yet the role of specific sugar-degrading pathways remains understudied.

This project aims to explore the association between the presence of sulfoquinovose (SQ) degraders in the gut microbiome and IBS status, using publicly available whole metagenome sequencing (WGS) data with clinical metadata. SQ is a sulfosugar derived from plant sulfolipids that reaches the colon and is degraded by specialized microbial pathways (e.g., sulfo-EMP, sulfo-TK, sulfo-TAL).

Our goal is to identify and quantify SQ-degrading bacteria in IBS and healthy individuals and assess whether their abundance correlates with disease status. If successful, this study could reveal novel microbial markers or metabolic features relevant to IBS and diet–microbiota–host interactions.

Tasks:
1. Dataset selection
Identify and download suitable public WGS microbiome datasets with IBS-related metadata (e.g., AGP, iHMP, curatedMetagenomicData).
2. SQ Degrader reference preparation
Curate a reference set of SQ degradation loci (e.g., sulfo-EMP genes) from known gut bacteria (e.g., E. coli, Bacteroides, Bilophila).
3. Read mapping and coverage estimation
Map WGS reads to the SQ loci using bwa or minimap2 and quantify locus coverage using CoverM or similar tools.
4. Statistical analysis
Compare the abundance of SQ degradation loci between IBS and healthy controls. Perform statistical tests and visualize results (e.g., boxplots, heatmaps).
5. (Optional): Machine learning
Use supervised learning (e.g., logistic regression, random forest) to evaluate the predictive value of SQ degrader presence for IBS status.

Requirements for the team:
1. Microbiome / metagenomics experience ( familiarity with shotgun microbiome analysis tools (e.g., bwa, CoverM, Kraken2, etc.))
2. Linux / command line proficiency (ability to run pipelines, manage large datasets, work with FASTQ/FASTA files etc)
3. Python or R programming skills (for statistical analysis and visualization (e.g., pandas, seaborn, ggplot2, MaAsLin2))
4. (Optional) Machine learning skills (experience with scikit-learn or similar libraries for basic model training/evaluation)
5. Team communication (ability to work collaboratively and document steps clearly for reproducibility)

The goal of the project is to achieve early detection and prevention of chronic non-communicable diseases (NCDs) through the use of artificial intelligence. The diseases, such as heart attack, stroke, hypertension, and type 2 diabetes, are the number 1 cause of death in Central Asia.

Tasks:
We aim to develop a machine learning model that predicts the risk of developing non-communicable diseases (NCDs) based on complex data, including demographics, laboratory indicators, behavioral indicators, social determinants, and family history.

• Collect data (indicators)
• Build a basic model (Logistic regression → XGBoost).
• Create a dashboard to visualize risks
• Output interpretable recommendations (SHAP values)
• Multilingual UX for the population

Requirements for the team:
Logistic Regression, XGBoost, SHAP values

Goal:
Empower healthcare providers with tools to flag high-risk patients sooner, improving outcomes through earlier diagnosis and targeted follow-up.

Challenge Overview:
Help clinicians detect serious blood disorders earlier by building predictive models based on real-world clinical data. You'll be working with data from outpatient visits and laboratory tests to identify patterns that signal early stages of oncological hematologic conditions such as multiple myeloma and various types of leukemia (ICD-10: C90, C91, C92, C83, D70).

Available Data Sources:
- Laboratory test results
- Outpatient visit records (including symptoms, referrals, and timelines)

Tasks:
– Predictive modeling: develop a model that estimates the likelihood of a patient receiving a specific hematologic cancer diagnosis based on their lab results and visit history.
– Feature exploration: identify key lab indicators and visit patterns that serve as early warning signs.
– Clinical insight: translate your model’s outputs into interpretable risk profiles that could support early intervention.

Requirements for the team
Experience of working with classical statistical models, ML, prediction and forecasting. It is desirable to have at least one clinician on the team.

I take this project as an opportunity to learn some basics of stylometry and text-mining techniques (preferably interpretable and grounded in statistics rather than in ML) in a light-hearted and tongue-in-cheek setting.

In order to set a clear goal and give the project a more applied dimension, I propose analysing a rich corpus of texts that are officially ascribed to a particularly productive Russian author specialising on writing paper-back detective stories (the author also has a couple of cookbooks in their bibliography ;). The author is often accused of being improbably productive with the implication of employing ghost writers rather than writing the texts on their own. I thus propose analysing the texts using the methods of stylometry and text-mining and verifying the authorship of the texts at hand. If we manage to discover something interesting along the way or come up with additional research questions and goals --- all the better!

Tasks:
1. Learn the basics of statistical stylometry and text-mining and get acquainted with the software solutions that can be used to apply the techniques on practice.
2. Apply the gained skills and knowledge to analyse a corpus of texts with the aim of verifying the claimed authorship.
3. Have fun!

Requirements for the team:
1. Basic programming skills in R and / or Python.
2. Skills in statistics, phylogenetics, text-mining are an advantage, but are not strictly necessary.
3. Basic skills and experience with research (as in googling, reading handbooks, manuals, papers and package documentation).

This public health project aims to develop an accessible web-based tool that calculates personalized genetic risk scores for rheumatoid arthritis (RA) using validated polygenic risk scoring methodology. Users will upload their genetic raw data files (VCF format) obtained from direct-to-consumer genetic testing services, and the application will parse known single nucleotide polymorphisms (SNPs) associated with RA risk to compute a Polygenic Risk Score (PRS).

The tool addresses a critical public health need by making genetic risk assessment more accessible to individuals, potentially enabling early intervention strategies and personalized healthcare approaches for rheumatoid arthritis prevention and management.

Tasks:
Backend Development:
VCF file parser for extracting relevant SNP data
Implementation of validated RA PRS calculation algorithms
Database design for storing SNP weights and population statistics
API development for secure data processing

Frontend Development:
Responsive web interface for file upload
Risk visualization dashboard with charts and interpretable metrics
Educational content about genetic risk and RA
User-friendly report generation

Data Security & Privacy:
Secure file handling and temporary storage protocols
Data encryption and anonymization procedures
GDPR-compliant privacy features

Validation & Testing:
Cross-validation with published RA PRS studies
User experience testing and interface optimization
Performance testing for various file sizes

Requirements for the team
– Experience with genomics tools (PLINK, bcftools, or similar)
– Knowledge of statistical genetics and population stratification

This project aims to identify and characterize patterns of thrombotic microangiopathy (TMA) in patients following hematopoietic stem cell transplantation (HSCT). By analyzing clinical, laboratory, and histopathological data, we seek to improve early recognition of TMA subtypes, understand their associations with transplant-related complications, and inform diagnostic and therapeutic strategies.

Tasks:
– Identify cases of thrombotic microangiopathy (TMA) post-HSCT based on established diagnostic criteria.
– Analyze clinical and laboratory profiles to detect distinct patterns or subtypes of TMA.
– Evaluate associations between TMA patterns and risk factors such as conditioning regimens, GVHD, infections, and immunosuppressive therapy.
– Assess patient outcomes in relation to TMA subtype, timing, and treatment.
– Compare existing diagnostic criteria and their applicability in post-HSCT settings.

Requirements for the team
Data analysis, machine learning, unstructured text processing.

Questions?

hack@bioinf.institute