LIDo banner

Apply now

Find out more about the different routes to entry and our eligibility criteria

Trupti Gore & Justin Barton: nuTCRacker: Predicting the Recognition of HLA-I–Peptide Complexes by αβTCRs for Unseen Peptides

trupti
The ability to predict which antigenic peptide(s) the αβTCR of a given CD8+ T-cell clone can recognise would represent a quantum leap in the understanding of T-cell repertoire selection and development of targeted cell-mediated immunotherapies. Current methods fail to make accurate predictions for antigenic peptides not present in the training dataset. Here, we propose a novel deep learning method called nuTCRacker that makes accurate predictions for a subset of unseen peptides, with an AUC > 0.7 for around a third of peptides evaluated using a large dataset compiled from curated public resources. An additional evaluation was undertaken using a small cellula-validated dataset of αβTCR peptides associated with cancer. Our analysis suggests that it is possible to make useful predictions for an unseen peptide provided the training dataset contains: many samples with the same HLA class I molecule as that bound to the peptide; at least one peptide that is similar to the target peptide; and a small number of αβTCRs that are similar to those bound to the unseen peptide of interest.

1 Introduction

The human leucocyte antigen class I (HLA-I) antigen presentation pathway, whereby peptides are presented for inspection by αβT cell receptors αβTCRs present on the surface of CD8+ T cells, is one of the core elements of the adaptive immune system and plays a crucial role in fighting intracellular pathogens and in the eradication of tumour cells [1]. In addition, the avidity of αβTCRs on thymocytes—the precursors of mature CD8+ T cells in the periphery—for self-antigenic peptides presented by HLA-I molecules in the cortex and medulla of human thymi is considered a key factor in the positive and negative selection processes that underpin central tolerance [2].

Advances in high throughput sequencing techniques have led to an explosion in the number of known αβTCR sequences, with resources such as the iReceptor [3] providing public access to billions of TCR β chains and increasing numbers of paired TCR α and β chains. At the same time, the number of peptides known to be presented by HLA-I molecules is increasing rapidly, partly owing to the development of high-throughput elution-based strategies and increasingly precise bioinformatic strategies for the correct identification of both canonical and noncanonical peptides bound to HLA-I molecules [4-10]. Moreover, computational methods for predicting peptide-HLA-I binding are more and more efficient and widely used and have become the core of most of the pipelines predicting novel epitope candidates and optimizing vaccine development [11-14].

There remains, however, a critical gap to link the specific recognition by CD8+ T cells to what is presented by HLA-I molecules and the degrees of cross-recognition of multiple peptide-HLA-I: in all but a tiny fraction of cases, we don't know what peptide-HLA-I complexes a given αβTCR can bind to with sufficient affinity to facilitate T cell activation in the periphery or with the right affinity to pass central tolerance selection in the thymus. This knowledge is essential for understanding the basic nature of CD8+ T cell repertoire formation, and supports cancer immunotherapy design by aiding the identification of TCRs that target patient-specific neoantigens, although CD8+ T cell activation goes beyond the complex αβTCR recognition of peptide-HLA-I complexes [1, 15, 16]. There are in silico—in cellula methods for the identification of peptide-HLA-I sequences recognised by orphan αβTCRs, although they do require a strong immunological background and skills for laboratories that would like to take that path [17]. Having reliable models able to predict the specific binding of a given αβTCR to peptide-HLA-I complexes would act as a driving force for research in many medical fields. However, from a computational perspective, predicting the αβTCR-peptide-HLA-I binding affinity has proved much less tractable than the peptide-HLA-I binding affinity prediction, the latter being tightly constrained (in terms of peptide conformation and the presence of distinct binding pockets that accommodate specific peptide sidechains), such that a moderate degree of accuracy is possible even with simple matrix-based approaches. Contrastingly, the αβTCR-peptide-HLA-I binding involves at least a subset of the αβTCR's flexible complementarity determining regions (CDRs) engaging with the comparatively flat surface formed by a peptide-HLA-I complex.

Although some progress in αβTCR-peptide-HLA-I binding prediction has been made using classical machine learning methods such as Random Forest, recent research has focused on deep learning, with numerous methods developed since 2020 that vary those in terms of the input data they utilize, how that data are encoded and the choice of deep learning architecture [18, 19] (Table S1). Robust performance comparisons based on these methods’ published levels of prediction accuracy are impossible, given that different training and evaluation data have been used in their development. There is broad acceptance that, so far, useful levels of αβTCR-peptide-HLA-I binding prediction accuracy appear possible for peptides that occur within the training set, that is, predicting whether an unseen αβTCR is capable of binding to a given peptide-HLA-I complex that occurs within the training set bound to different TCRs. It has been estimated that as many as 50 to 200 αβTCRs bound to the same target peptide are required to make good predictions with the current model [20, 21]. In contrast, generalizing the αβTCR-peptide-HLA-I binding prediction for unseen peptides (where no αβTCRs capable of binding to that peptide are present in the training set) remains intractable [22-24]. Unfortunately, the latter is exactly what we would need a predictor to do, to fully understand, for example, what antigenic peptides (if any) drive the CD8+ T cell repertoire selection in human thymi or what (neo)epitopes are recognised by tumour-infiltrating lymphocytes and could cancer patients be vaccinated against.

2 Results

2.1 Dataset Preparation for Model Training

To address these issues, we developed a novel αβTCR-peptide-HLA-I binding predictor (Novel Universal Transformer for Cross-epitope Recognition using Advanced Cross-domain Knowledge Extraction and Representation, nuTCRacker), benchmarked it with other models, and applied it to a small set of αβTCR-peptide-HLA-I experimentally tested in our laboratory.

The first key and challenging step for model development and training is dataset preparation. To this end, we prepared two kinds of datasets (Figure 1). The first contained unlabelled data—both αβTCRs of unknown peptide specificity and peptide-HLA-I data with unknown cognate TCRs—whereas the second contained labelled data with αβTCRs matched to peptide-HLA-I complexes.

The former kind of dataset was used to explore the potential benefits of adopting transfer learning approaches in the context of αβTCR-peptide-HLA-I binding prediction. Bulk αβTCR repertoire data were downloaded from iReceptor [3], selecting only productive sequences (i.e., sequences whose gene rearrangements would produce a functional receptor) from the human loci TRA or TRB with specified CDR3, and V and J genes. The CDR1 and CDR2 were not explicitly required as these are completely determined by the V gene allele. Duplicate sequences were removed, resulting in 410,499 paired (i.e., comprising both the α and β chains of a given αβTCR) and 24,429,131 unpaired (i.e., only a single α or β chain from a given αβTCR) sequences. Additional bulk αβTCR data were downloaded from 10x Genomics (10x Genomics, 2023), and filtered to include only αβTCR data with high confidence, resulting in 201,094 paired sequence records. We found that it was sufficient to pre-train only with the paired sequence records, saving substantial pretraining compute. This is consistent with findings in antibody language models [25]. With respect to the peptide-HLA-I component, bulk data were downloaded from the training set provided with NetMHCpan 4.1 [26] and filtered to include only peptide-HLA-I complexes where the binding has been experimentally confirmed.

Read full article here