TeamVictoria Fedulova
Nikita Katermin
Daria Klimova
Fedor Logvin
Anton Sivkov
Xenia Sukhanova
This bioinformatics project integrates RNA-seq data analysis and machine learning to identify new potential biomarkers of aging. By applying rigorous computational methods and validation experiments, the project aims to contribute to our understanding of the aging process and potentially provide insights into the development of therapies targeting age-related diseases.
TasksData Collection and Preprocessing:
Acquire RNA-seq data from publicly available aging-related studies, ensuring diverse tissue types, age-releated diseases and age groups are included.
Conduct quality control measures to remove outliers and low-quality samples, which may involve examining sequencing metrics, removing samples with low read counts, or checking for batch effects.
Perform read alignment against a reference genome and quantify gene expression levels using tools such as STAR, HISAT2, or Salmon.
Differential Expression Analysis:
Employ statistical methods like DESeq2, edgeR, or limma to identify genes that exhibit significant differential expression with age.
Apply appropriate statistical tests to determine significant age-related gene expression changes, considering factors such as fold-change and adjusted p-values.
Generate a list of potential aging-associated genes based on the statistical significance of their expression changes.
Feature Selection and Dimensionality Reduction:
Employ feature selection techniques to reduce the number of genes and focus on the most informative features for aging prediction.
Utilize methods like variance thresholding, mutual information, or recursive feature elimination to identify genes that contribute most to age-related variations.
Perform dimensionality reduction techniques such as PCA or t-SNE to visualize the data in lower-dimensional space and detect underlying patterns or clusters.
Machine Learning Model Development:
Find new, previously unknown biomarkers using machine learning, based on the similarity of the found biomarkers using rna-seq data analysis
Split the data into training and testing sets to evaluate model performance.
Explore a range of machine learning algorithms, such as random forest, support vector machines, logistic regression, or neural networks.
Train the models using the training set, optimizing hyperparameters through techniques like grid search or Bayesian optimization.
Assess the performance of the models using appropriate evaluation metrics such as accuracy, precision, recall, F1-score, or area under the receiver operating characteristic curve (AUC-ROC).
Biomarker Validation and Interpretation:
Validate the selected machine learning model using independent datasets or through cross-validation techniques to assess its generalization performance.
Interpret the model's feature importance scores or coefficients to identify the most informative genes (biomarkers) for aging prediction.
Conduct gene ontology enrichment analysis or pathway analysis to explore the biological functions and pathways associated with the identified biomarkers, gaining insights into the underlying mechanisms of aging.
Requirements for the teamR or Python programming, skills of NCBI using, GEO particularly, RNA-seq data analysis skills.