This project introduces the first framework that simultaneously tackles peptide–MHC (P–M), peptide–TCR (P–T) and peptide–MHC–TCR (P–M–T) link-prediction as a single multimodal graph problem. Uniquely, it:
• Integrates pretrained language models to encode textual annotations and protein/DNA sequences,
• Applies graph contrastive learning to align node embeddings with the underlying interaction topology,
• Leverages knowledge-graph embedding methods for end-to-end link prediction.
The goal is to learn universal node embeddings—the same representations for peptides, MHC alleles and TCRs—that achieve state-of-the-art accuracy and robust generalization on entirely unseen entities. By unifying all three interaction types and fusing multiple data modalities in one graph-based model, this approach goes beyond prior work by incorporating rich textual and chemical information via contrastive pretraining, yielding embeddings that are both highly predictive and broadly transferable.
Tasks:
1. Modality Embedding
• Load cached embeddings:
• Text (epitopes, MHC annotations) → BioBERT
• Protein/Gene sequences → ESM/ProtBert
• Validation: check tensor shapes, data types, and absence of NaNs
• Prepare GNN input by assembling a feature dictionary for each node
2. Fusion Module
• Implement simple averaging of modality embeddings for each node
• Ensure the resulting vector has the correct dimensionality and contains no invalid values
• Integrate the fusion operation into the pipeline immediately before the GNN
3. Graph Neural Network
• Configure a single‐layer GNN (e.g. GraphSAGE or equivalent) to update node embeddings using three edge types
• Verify that the updated embeddings are computed correctly for peptides, MHC, and TCR nodes
4. Prediction Heads & Negative Sampling
• Add three prediction “heads” for P–M, P–T, and P–M–T interactions
• Define a negative‐sampling strategy for each task by replacing one component of each true interaction
5. Training & Optimization
• Build a unified training loop that combines fusion, GNN, and prediction heads
• Select and fix hyperparameters (number of epochs, batch size, learning rate)
• Run a quick trial to confirm that loss decreases and AUC metrics improve
6. Evaluation & Visualization
• Evaluate performance using ROC-AUC and PR-AUC for P–M, P–T, and P–M–T
• Produce qualitative embedding visualizations (t-SNE or UMAP)
• Prepare a report with a table of results and 1–2 illustrative plots
7. Optional (time permitting)
• Add a Graph Contrastive Learning step with a contrastive loss between pre‐ and post‐GNN embeddings
• Implement a KGE model (TransE or DistMult) for direct link prediction as an alternative to the simplified BCE approach
Requirements for the team:
Proficient in Python 3.7+, strong experience with PyTorch, comfortable using HuggingFace Transformers (BioBERT, ESM/ProtBert), understanding of Graph Neural Networks (e.g. GraphSAGE, GAT), basic knowledge of contrastive learning methods on graphs, version control with Git/GitHub, environment and dependency management (conda, virtualenv), basic bash/linux command-line skills.