LIDo banner

Apply now

Find out more about the different routes to entry and our eligibility criteria

James Sweet-Jones: An antibody developability triaging pipeline exploiting protein language models

JSJ
Monoclonal antibodies (mAbs) have been shown to be a successful class of biologic drugs which have potential to treat a wide variety of diseases owing to their ability to target a specific antigen, and therefore potentially any step in a disease pathway.


Citation1,Citation2 As of early 2025, at least 130 mAbs have received regulatory approval from the U.S. Food and Drug Administration or the European Medicines Agency (db.antibodysociety.org/) with at least 42 being considered as ‘fully-human’, either from transgenic mice, phage display libraries, or cloned from recovering patients.Citation3–5 The annual growth of this sector has increased by between 20% and 30% per yearCitation6,Citation7 and is likely to continue to grow as interest increases in the use of antibodies to target previously undruggable targets.Citation8 Despite this, throughout the clinical pipeline for the development of new mAbs, there is a high risk of failure, causing costly discontinuation from trials.Citation9

Simultaneously, efforts in single-cell sequencing techniques have been applied to understand how the antibody repertoire functions and changes over time at the level of single B cells.Citation10–13 This has given researchers the ability to generate dense digital libraries of paired variable heavy (VH) and variable light (VL) human antibody sequences that vastly outnumber previous databases resulting from sequence or structural data (KabatMan,Citation14 IMGT,Citation15 SAbDAbCitation16 AbDbCitation17 and EMBLIg (abybank.org/emblig/)). Online repositories including the Observed Antibody Space (OAS),Citation18 cAb-RepCitation19 and BRepertoireCitation20 allow researchers access to these resources.

With the generation of these in silico databases, efforts to develop screening statistics to identify sequences with physical characteristics similar to approved therapeutics has become a driver in the field. Usually, these have been based on antibody developability, which is loosely defined as an antibody’s intrinsic ability to be produced on an industrial scale, to maintain reasonable stability in long-term storage and in patients, and to be safely tolerated by the patient.Citation21,Citation22 Such considerations have now become important in the early stages of drug screening to select the best quality candidates and avoid costly late-stage failures.Citation1 Furthermore, developability is important, but does not guarantee success in clinical trials, where candidates may face discontinuation for safety or efficacy reasons. Identifying factors important in determining success in clinical trials has also eluded researchers.

Physicochemical features, including surface charged patches, surface hydrophobic patches, low thermostability, and post-translational modification sites that introduce heterogeneity, have become associated with poor antibody developability.Citation23 Those features that compromise the stability of the antibody can cause unfolding, increase the propensity to aggregate in solution and can increase immunogenicity.Citation24,Citation25 At the lead candidate stage, well-defined experimental assays for measurement are important in the selection of a final lead.Citation26,Citation27 However, it has become useful to predict these features at an earlier stage using computational means. To this end, sequence-based statistics have been developed based on these features and are available for use in drug discovery pipelines, including the Developability Index,Citation28,Citation29 AbPred,Citation30 and, more recently, the Therapeutic Antibody Profiler (TAP)Citation31 and Therapeutic Antibody Developability Analysis (TA-DA Score).Citation32 However, these tools can fall short in identifying leads from large libraries of data, requiring computationally expensive 3D modeling, or only taking one antibody at a time, which is usually expected already to be a potential lead candidate.

In order to take advantage of the wealth of data now available, the field has also turned to machine learning as a new avenue of exploration.Citation1,Citation33,Citation34 For protein sequences to be suitable inputs for machine learning problems, it is necessary to encode them numerically. Previously, this has been done by using evolutionary or physicochemical and structural features,Citation35–37 and simple regression models to identify features of high importance, or to predict features from the sequence as done in AbPred.Citation30 Negron et al.Citation32 expanded on this work and identified previously mentioned characteristics, including hydrophobicity (assessed by hydrophobic interaction chromatography), thermostability (Tm, assessed by differential scanning fluorimetry) and aggregation (assessed by cross-interaction chromatography) that were associated with the identification of clinically acceptable mAbs. Furthermore, this work has demonstrated an ability to separate clinical antibody sequences from antibody repertoires and to assign a developability score based on these features as part of their TA-DA score.

Many studies, including those described aboveCitation28–32 and othersCitation38 as well as reviews,Citation39–41 have described the importance of predicting developability and most of these approaches rely on the assumption that clinical antibodies (i.e., approved, discontinued and in-development mAbs) have a range of properties related to developability such that novel antibodies with similar (predicted) properties will also be clinically acceptable. This is not to say that antibodies with very different properties will necessarily fail in the clinic, but that such approaches allow one to focus on antibodies most likely to succeed. The need to exploit ‘big data’ and artificial intelligence in the development of biologics has been discussed by Fernández-Quintero et al.Citation39 and Narayanan et al.Citation42 A newer method of encoding protein sequences is to use ‘protein language models’.Citation43 These are deep learning encoders trained on the relationships between residues in a sequence using millions of sequences. The results give dense numerical representations of sequences that may then be used as training data for machine learning models and, over the last few years, have revolutionized predictive methods in all areas of bioinformatics (see Lin et al.,Citation44 for example). Their power comes from their ability to encode more information, including less obvious features and combinatorial or multi-factor features (e.g., from interaction of amino acids).

Read full publication here