Maximilian Gantz†
a,
Simon V. Mathis†b,
Friederike E. H. Nintzel†a,
Pietro Liob and
Florian Hollfelder*a
aDepartment of Biochemistry, University of Cambridge, 80 Tennis Court Road, Cambridge, CB2 1GA, UK
bDepartment of Computer Science, University of Cambridge, 15 JJ Thomson Avenue, Cambridge CB3 0FD, UK
First published on 23rd April 2024
Protein design and directed evolution have separately contributed enormously to protein engineering. Without being mutually exclusive, the former relies on computation from first principles, while the latter is a combinatorial approach based on chance. Advances in ultrahigh throughput (uHT) screening, next generation sequencing and machine learning may create alternative routes to engineered proteins, where functional information linked to specific sequences is interpreted and extrapolated in silico. In particular, the miniaturisation of functional tests in water-in-oil emulsion droplets with picoliter volumes and their rapid generation and analysis (>1 kHz) allows screening of >107-membered libraries in a day. Subsequently, decoding the selected clones by short or long-read sequencing methods leads to large sequence-function datasets that may allow extrapolation from experimental directed evolution to further improved mutants beyond the observed hits. In this work, we explore experimental strategies for how to draw up ‘fitness landscapes’ in sequence space with uHT droplet microfluidics, review the current state of AI/ML in enzyme engineering and discuss how uHT datasets may be combined with AI/ML to make meaningful predictions and accelerate biocatalyst engineering.
Fig. 1 Profiles of low to ultrahigh throughput experimentation systems (tubes, multiwell liquid handling systems, HT-MEK7 and emulsion droplets8) that may be used as data generation tools for machine learning with their specific benefits and limitations. |
Three examples show this technology in action, taking library screening experiments in droplets (Fig. 2) all the way to ‘maps’ of sequence space (Fig. 3). Each example illustrates a distinct workflow (Fig. 2) in which large scale screening allows to infer information on fitness landscapes, generating substantial datasets that may be useful for AI/ML interpretation:
(A) Evolution of an amine dehydrogenase (AmDH), a valuable biocatalyst for the synthesis of chiral amines. Zurek et al.20 screened libraries of AmDH variants generated by error-prone PCR mutagenesis. Libraries were transformed into E. coli for expression. Single cells were encapsulated into droplets with substrates for a coupled assay using the dye WST-1 as a turnover sensor (Fig. 2A). Positive variants were selected based on an absorbance measurement (>105 droplets per hour, but faster systems are now available12,21) and DNA was recovered and sequenced using UMI-linked Oxford Nanopore sequencing (UMIC-seq) to achieve high-quality.
(B) Mutational scanning of a protein kinase involved in signaling networks. The human protein kinase MKK1 is an example of a broad class of phosphate transfer enzymes involved in signalling networks. In order to explore how these evolve, MKK1 (which targets ERK2) was randomised with a focus on six residues in its docking domain (D-domain), which mediates interaction with the downstream kinase ERK, activating its kinase activity. Each MKK1 variant was tested for its ability to bind and phosphorylate ERK2 in a coupled assay (Fig. 2B) exploring a scenario of neutral roaming in sequence space (i.e. a non-adaptive evolution experiment). The library was expressed (using a commercial in vitro transcription/translation system) in a polydisperse emulsion containing monoclonal magnetic beads. This cell free approach alleviates issues that frustrated previous in vivo kinase screens such as cellular background and functional redundancy, while it simultaneously benefits from the robust expression of kinases in an in vitro transcription/translation system. Selections were carried out in polydisperse emulsions (not even necessitating the use of microfluidics) and the gene as well as the substrate (giving a GFP readout when a kinase target sequence was protected by successful kinase action against proteolysis) were immobilized on a bead, so that flow cytometric sorting (FACS) could be used to identify active clones. Using next-generation sequencing (NGS) of the D-domain to calculate enrichment scores, functional combinations of D-domain variants were mapped out.
(C) Identifying promiscuous phosphotriesterases in metagenomic libraries. A metagenomic library with 1.25 million genomic inserts of mixed environmental origins (soil, degraded plant material and cow rumen) was screened using a fluorescent assay reporting on phosphotriesterase activity (Fig. 2C). The brightest 0.001% of droplets were sorted, sequenced using Sanger sequencing and characterised to reveal novel, uncharacterised “bridgeheads” in sequence space which is now functionally annotated in areas where homology-based classification would not have predicted phosphotriesterase activity.
The type of label that can be obtained from a large-scale droplet experiment is highly dependent on the chosen library size and design, the microfluidic workflow, and the choice of the sequencing strategy. NGS offers high enough sequencing depth to generate binned quantitative data (granular enrichment scores) for sequence-function mapping: reporting how often a variant occurs in the input vs. the output library. The technology used for sequencing determines the information content further. Short reads with only up to 600 bp read length (with 2 × 300 paired end sequencing) adequately describe mutational patterns in small proteins26 or functionally defined regions of proteins.25 However, long read sequencing technologies are necessary to reveal long-range epistatic effects in larger proteins. Corresponding datasets can be obtained with PacBio or Oxford Nanopore instruments. Oxford Nanopore sequencing is cheap (<1.1¢ per sequence)20 and can be carried out in any lab at low cost, while the capital expenditure for a PacBio (250000$ for PacBio vs. <1000$ for a MinION device) makes this impractical. The two technologies differ in their read quality, with PacBio giving high quality reads at single nucleotide resolution. Oxford Nanopore devices suffer from high error rates and are unable to pin-point single nucleotide mutations, but a workaround – consisting of UMI (unique molecular identifier) labelling followed by clonal amplification and consensus generation from multiple sequences (that are tagged by the same UMI)20 – exists to produce high quality sequences of even single amino acid mutants. While short-read NGS can be used to generate binned quantitative data with granular enrichment scores, long-read sequencing technologies operate at lower scale (90 Gb for PacBio, 50–110 Gb for Oxford Nanopore compared to up to 3000 Gb with Illumina sequencing) and are currently limited to the generation of binary data on variant identification per round of selection in directed evolution. (but see ref. 58 for new long read approach employing Oxford Nanopore devices).
Each of the three studies reviewed here (Fig. 3) uses different experimental designs, so the sequencing strategies are correspondingly different, but all three arrive at representations of hits in sequence space that can be interpreted as fitness landscapes:
(A) AmDH screening (Fig. 3A). In AmDH evolution long-read Nanopore sequencing (in a commercial MinION flow cell; Oxford Nanopore) was used to sequence 3000 hits with an activity higher than the threshold chosen for screening. A crucial accuracy improvement is achieved by tagging variants with unique molecular identifiers (UMIs): these are then amplified clonally, multiple nanopore sequences are generated and finally evaluated by deriving a consensus from many reads per amplified variant. In this way the sequencing accuracy was dramatically increased to >99.99%. The improved accuracy for cost efficient long-read nanopore sequencing is crucial for confidently resolving multiple mutations per variant and thus mapping evolutionary trajectories. The resulting dataset gives a fitness landscape shown in Fig. 3A that illustrates the evolution of a functional protein through three generations of ultrahigh throughput screening in directed evolution, in which the 3000 best hits of 250000 variants were sorted and sequenced. The apparent clustering reveals intra-gene cooperativity of mutations (epistasis), for which accurate long read sequencing was necessary and provided experimental evidence for sign epistasis. Information from multiple rounds of directed evolution constitutes a dataset conditioned by the combinability of mutations. The analysis of evolutionary trajectories in this way helps to extract features for further labelling and reconstructing or extrapolating functional evolution. Such features will be identified by their acquisition and conservation through rounds of evolution and may include residues with a catalytic function (located near the active site), but also enhancing solubility (conferred by residues the outside of the protein), stability (e.g. residues allowing improved packing or better hydrophobic interactions in the core of a globular protein), introduction of conformational flexibility or disorder (e.g. in order to facilitate recognition of new substrates or remove steric clashes) and finally patterns of the aforementioned epistatic interactions (i.e. long range interactions between often distant residues).
(B) Kinase screening (Fig. 3B). The narrow focus on the well-known docking domain (D-domain) of kinases made it possible to use the short reads provided by Illumina sequencing to draw up a fitness landscape. A starting library of 500000 mutants was generated from randomising six residues in the MKK1 docking domain (synthesised on beads by spit-and-mix assembly, with high quality and equal representation of nucleotides27). Library members were sorted into three bins according to activity. 2.9 × 104 MKK1 variants are functional, providing a rich dataset to explore cooperativity between the different randomised positions. Enrichment analyses identified patterns of interdependence between the randomized positions, highlighting the role of cooperative hydrophobic effects and charge balance. Taken together, the patterns are displayed in a fitness landscape in which transitions from one sequence motif to another are generally possible. Many well-connected variants capable of substrate binding and phosphorylation suggest high evolvability. The extensive well-labeled sequence dataset (Fig. 3B) carries information about implicit positive epistasis and may be further interpretable by ML in the future.
(C) Triesterase screening (Fig. 3C). Screening of a metagenomic library (in binary mode for overcoming a phosphotriesterase activity threshold) yielded 8 hits, the majority of which had not been recognized as phosphotriesterases before. These new enzymes will constitute bridgeheads in sequence space for further annotation, being selected for function rather than found by sequence homology. New functional motifs were recognized, e.g. an α/β hydrolase fold, in which a catalytic triad (with a cystein nucleophile) served as a multiple turnover catalyst, despite its similarity to the target of phosphotriester as a toxin, an active site catalytic triad (containing serine) that is suicide-inhibited by the triester. Newly identified enzymes from this approach will be useful as a binary activity label for ML-based functional annotation to further annotate sequences in large metagenomic databases such as MGnify.28
Fig. 3 Functional annotation of sequence space. (A) Exploring productive trajectories on the fitness landscape of an amine dehydrogenase in three rounds of directed evolution.20 (B) Scanning the fitness landscape of a short kinase docking domain (D-domain) with increasing thresholds for comprehensive epistasis mapping.25 (C) Identifying islands of sulfatase and phosphotriesterase function in an unexplored landscape through functional metagenomics.23 |
The three campaigns provide examples for sequence space explorations, in which the experimental design and selection criterion shapes both, the area of sequence space that is explored and the functional readout that ultimately completes a fitness landscape by adding a third, functional dimension to sequence space (as represented by two notional dimensions).
(A) The case of kinase MKK1 is producing a dataset (Fig. 3B) focused on the small fraction of sequence space represented by docking domain mutagenesis and functionally annotated with granular enrichment scores that map a smooth fitness landscape with many overlapping functional motifs.
(B) The data on AmDH (Fig. 3A)20 covers mutations across the entire protein (being derived from an epPCR library) and thus samples a larger area of sequence space. The dataset can be interpreted as an exploration of sequence space in all directions, as long as the selection criterion of increasing AmDH activity is fulfilled (measured by a binary assay). The resulting fitness landscape is more complex and shaped by long-range epistatic effects that define founder mutations, with considerable ‘ruggedness’ of the fitness landscape (resulting in some mutational paths closed off due to sign epistasis), but also with evidence for positive epistasis across the protein structure (where the combined effect of two mutations can be larger than the sum of their individual contributions). Ruggedness in the fitness landscape with fewer paths for evolution suggests that transitions are more difficult and the evolvability potentially reduced, due to the intrinsic response of this protein to mutations.
(C) Finally the sequence context in which new phosphotriesterases are found is much broader (Fig. 3C),23 starting from a diverse metagenomic library (rather than a randomised single protein) and identifying peaks only in a binary screen. Additional surrounding sequences can be derived from sequence repositories, but as their function is inferred rather than tested, no inference about the shape of a fitness landscape can be made: it is simply annotated.
Interpreting large sequence collections rather than individual single mutants (e.g. a ‘winner’ of a selection or screening experiment) may offer additional insight. It is tempting to hope that the data can be used to reliably extrapolate from experimentally characterized variants and predict new ones with higher fitness. Cooperative epistatic effects define an evolutionary trajectory and may be inferred from information on groups of mutants (either as long-range intra-gene effects in ‘founder mutants’ of AmDH20 or as short range effects focused on the MKK1 kinase D-domain25) and its analysis may allow predictions.29 Even for metagenomic explorations,23,30 functionally annotated data can be the basis of prediction.
To discuss the interface between ultrahigh-throughput experiments and AI, we must understand the AI enzyme engineering landscape (Tables 1–3.40,44–56 AI models differ in the extent to which they rely on rules derived from prior knowledge or autonomously identify statistical patterns in data without user input. A useful distinction can be made between expert systems that make decisions based on rules drawn up by a human expert (e.g. gravy hydrophobicity33 or BLOSUM substitution34). In contrast, machine learning is an umbrella term for techniques that do not rely on such rules, but instead derive rules from data (the “learning” aspect in machine learning, e.g. linear regression, random forest, etc.). Deep learning is a subclass of machine learning and is loosely distinguished from general machine learning by its large count of learnable parameters: often of similar or larger order of magnitude to the available datapoints (or beyond). Many contemporary neural network approaches, such as transformers35 (the main component of modern language models36), AlphaFold2 (ref. 31) and convolutional networks,37 belong to this category. The amount of data available is a first criterion in the choice of a model, with deep learning approaches being more data hungry, while general machine learning techniques can live with fewer data inputs. The parameters of these models are then tuned in one or more ‘training’ steps.
Enzyme | Data | Achievement | Ref. | |||||
---|---|---|---|---|---|---|---|---|
Reaction | Library screening | Library type | Data type | Total data points | Data points used for ML | Improvement top variant | % success (>wt) | |
a Training data heavily biased towards single mutations. A more sophisticated structure guided model that is less biased on single mutation data is also presented and shows similar improvements in conversion but no specific activity is reported.b Possible to express 168/200 ordered double/triple mutants.c Training data: (9 single mutants + 7 higher order combinations of those 9 singles from a previous DE campaign); improvement for same pH as training data (claimed 121-fold improvement at different pH).d Two variant engineered (S) & (R) specific: 93% ee/79% ee for (S/R) respectively, starting from 76% ee (S). | ||||||||
Imine reductase (IRED) | Robotic screening (plate) | 20 random singlesa | Conversion | 11303 | 20* | Specific activity: ∼wt conversion: 4.6-fold | n.a.b | 44 |
Mixed (singles + 1 EPCR round)a | ∼5000 | Specific activity: 1.3-fold conversion: 8.3-fold | ||||||
Mixed (singles 2 epCR rounds)a | ∼8000 | Specific activity: 1.3-fold conversion: 7.1-fold | ||||||
Microfluidics | epPCR | Conversion | 17143 | 10860 | kcat/KM: 16-fold | 70% | 58 | |
kcat: 23-fold | ||||||||
Glucose oxidase | Spectro photometer (plate) | Focused – from a previous campaign | Michaelis–Menten | 16c | 16c | kcat/KM: 12.1-foldc | One variant | 46 |
kcat: 4.8-fold | ||||||||
Halogenase | LC-MS (plate) | Focused (3 sites) | Conversion | 504 | 504 | Conversion: 16-fold | 100% | 47 |
kcat/KM: 82-fold | ||||||||
kcat: 93-fold | ||||||||
Hydroxylase XylM | Biosensor (plate) | Focused (5 sites) | Sensor coupled to fluorescent protein | Round 1: 126 | Round 1: 126 | Yield: 15-fold | Sensor: 94% | 48 |
Round 2: 126 + 50 | Round 2: 126 + 50 | |||||||
Nitric oxide dioxygenase | Plate assay | Focused | Enantiomeric excess | Round 1: 124 | Round 1: 124 | Lysate activity: 3.2-fold; | n.d. 360 predictions | 49 |
Round 2: 155/166 | Round 2: 155/166 | e.e.: 1.2-fold and reversedd | ||||||
Luciferase | Bioluminescence (plate) | Focused (non-cons regions) | Bioluminescence | 164 | 164 | Specific activity: 7.8-fold | 72% (26/36) | 50 |
Beta lactamase | Antibiotic resistance | Error-prone PCR | Antibiotic resistance | 96 and 24 | 96 and 24 | Enrichment up to ∼40-fold vs wildtype | 2.5% | 53 |
Enzyme | AI model specification | Achievement | Ref. | |||||
---|---|---|---|---|---|---|---|---|
Reaction | Model type | Training regime | Usage regime | Design space | Target property | Improvement top variant | % Success (>wt) | |
a Starting with a panel of models from scikit-learn, the top three model types were selected and used to identify the top 1000 sequences in each predicted library.b Presumably doubles/triples of the 20 input singles were considered.c Possible to express 168/200 ordered double/triple mutants.d Random forest on UniRep 1900 descriptors. Note: UniRep1900 is in principle a self-supervised trained language model, so it could be argued the training regime was supervised + self-supervised and the usage regime was assay aligned rather than assay supervised.e Training data: (9 single mutants + 7 higher order combinations of those 9 singles from a previous DE campaign); improvement for same pH as training data (claimed 121-fold improvement at different pH).f The selection of these 3 sites was based on (1) docking studies with the structure and (2) previously published literature results and (3) previous knowledge of the enzyme. This is not trivial to replicate for any enzyme.g This study used 2 models: a more shallow machine learning based one and a deep learning based one. Specific region: 5 determined via alanine scan and 50 variants were tested in each round.h Two variant engineered (S) & (R) specific: 93% ee/79% ee for (S/R) respectively, starting from 76% ee (S).i Two rounds of evolution performed, while most other studies listed here perform one round. | ||||||||
Imine reductase (IRED) | Random forest | Supervisedd | Assay supervised | Specific regionb | Specific activity & conversion | specific activity: ∼wt conversion: 4.6-fold | n.a.c | 44 |
Random forest | Specific protein | specific activity: 1.3-fold conversion: 8.3-fold | ||||||
Random forest & structure-informed | Specific proteinb | specific activity: 1.3-fold conversion: 7.1-fold | ||||||
Augumented ridge regression & decision tree with rational engineering | Supervised | Assay supervised | Entire protein | kcat and kcat/KM | kcat/KM: 16-fold | 70% | 58 | |
kcat: 23-fold | ||||||||
Glucose oxidase | Machine learning (partial least squares) | Supervised | Assay supervised | Specific region | Michaelis–Menten | kcat/KM: 12.1-folde | One variant | 46 |
kcat: 4.8-fold | ||||||||
Halogenase | Machine learning (Gaussian Process) | Supervised | Assay supervised | Specific regionf | Conversion | conversion: 16-fold | 100% | 47 |
kcat/KM: 82-fold | ||||||||
kcat: 93-fold | ||||||||
Hydroxylase XylM | Machine learning & deep learningg | Supervised & self-supervised | Assay supervised & assay aligned | Specific regiong | Sensor/yield | Yield: 15-fold | Sensor: 94% | 48 |
Nitric oxide dioxygenase | Machine learninga | Supervised | Assay supervisedi | Specific region | Lysate activity and stereoselectivity | lysate activity: 3.2-fold; | n.d. 360 predictions | 49 |
e.e.: 1.2-fold and reversedh | ||||||||
Luciferase | Machine learning (Gaussian process & self-play reinforcement learning) | Supervised | Assay supervised | Specific region | Bioluminescence | Specific activity: 7.8-fold | 72% (26/36) | 50 |
Beta lactamase | Deep learning (LSTM language model) | Self-supervised | Assay aligned | Specific region | Enrichment under Amp selection | Enrichment up to ∼40-fold vs wildtype | 2.5% | 53 |
Enzyme | AI model specification | Achievement | Ref. | |||||
---|---|---|---|---|---|---|---|---|
Reaction | Model type | Training regime | Usage regime | Design space | Target property | Improvement top variant | % Success (>wt) | |
a No wild type comparison available, % active variants used, all ordered (not only soluble) enzymes considered; engineering for pH stability.b Based on endonuclease structure and sequence conservation data.c Direct AI prediction is a single mutant (A53M) leading to 3-fold reduced side product formation, which was then combined with other predictions (rationally/assuming additivity) to get their 17-fold reduced off product formation.d Self-supervised: masked AA prediction in microenvironment; supervised: model selected based on correlation of zero-shot fitness with DeltaTM of single mutants in FireProtDB. | ||||||||
Malate dehydro-genease | Deep learning (protein GAN) | Self-supervised | Zero-shotd | Class of proteins | Specific activity | Wild-type like specific activity | 22a% | 45 |
Methyltransferase | Deep learning (MutComputeX) | Self-supervised & supervised | Zero-shot | Specific protein | Product titer | Conversion: 1.6-foldc | n.d. | 51 |
Beta lactamase | Deep learning (MutCompute) | Self-supervised | Zero-shot | Specific protein | BLA activity | Antibiotic resistance >wt, no quant measurement | 30% | 52 |
TEV protease | Deep learning (protein MPNN) | Self-supervised | Zero-shot | Specific region | Fluorogenic substrate | kcat/KM: 26-fold (but mainly tied to solubility/thermostability) | 3 out of 144 designs | 54 |
PETase | Deep learning (MutCompute) | Self-supervised | Zero-shot | Specific protein | PET hydrolysis activity | Specific activity: 29-fold | 80% | 40 |
Endonuclease (Ago Proteins – KmAgo) | Deep learning (CPDiffusion) | Self-supervised (on family-focussed dataset) | Zero-shot | Specific proteinb | ssDNA cleavage assay | DNA cleavage activity: up to 8.6-fold | 75% | 55 |
Lysozyme | Deep learning (ProGen language model) | Self-supervised | Zero-shot | Class of proteins | Michaelis–Menten kinetics | Wildtype-like activity | n.a. | 56 |
The training steps determine the data used and how it informs the model’s parameters. We distinguish between pre-training steps, which use general data such as the observed sequences on UniProt or general thermostability annotations from FireProtDB38, and assay specific training, which uses data from the targeted assay. A pre-training step may precede self-supervised or supervised learning: in the self-supervised mode only sequence or structure are available, while a functional label, e.g. an activity measurement, is absent. Instead of functional labels, “pseudo-labels” unrelated to function are created by masking parts of the sequence or structure and predicting the amino acids that should occupy the masked positions. This approach is called “self” supervised, because the labels are generated from the datapoint itself, through a masking process. This pre-training mode is used e.g. for protein language models39 and also for methods that take the structural environment into account.40,41 By integrating this information, the model learns to pick up on common sequence or structural motifs. Alternatively, when we have access to experimental mapping of sequence to function or a relevant proxy, a model may be pre-trained in a supervised way given the annotation. In contrast to general pre-training, assay-specific training, requires labels from the assay of interest and is therefore only possible in a supervised mode.
Pre-training steps and assay-specific training can be combined. Workflows may include pre-training steps (self-supervised or supervised) along with assay-specific training. The combinations of pre-training and assay specific training give rise to three broad usage regimes for a model to predict a target property (or generate a sequence with a desired target property value) that is probed by a specific assay run in the lab:
(i) Zero-shot: in this case a model is only pre-trained on general data and is used “as is” without supervised training on any assay labelled data to predict a target property. For example, a language model (such as ESM) might be trained through self-supervision (sequence masking) on all sequences observed in UniProt, and subsequently used in a “zero-shot” way by evaluating the probability that ESM assigns to a sequence containing a given mutation vs. the probability of the wildtype sequence. This assumes that the target property correlates with the self-supervision task that was used during training (e.g. thermostability, because ‘natural’ motifs in UniProt must be at least marginally thermostable to be observed in living organisms). As another example, we might pre-train a linear regression model “supervised” on cDNA display proteolysis data42 from general proteins, and then task the model to predict thermostability of our target protein “as is” (zero-shot). (ii) Assay aligned (also referred to as ‘transfer learning’ or ‘task-specific fine-tuning’ in the ML community): in the assay aligned regime, a model that was previously trained (=“pre-trained model”) on general data through self-supervision or supervision is “aligned” to the assay specific data through additional supervised training on a, commonly smaller, assay specific dataset. For instance, this may be achieved using the same model (e.g. ESM) and updating its parameters slightly based on the assay labelled sequence-to-function data (‘fine-tuning’). As another example, one may use another model which uses representations or outputs from the pre-trained model as some of its inputs and train it on the assay labelled data (‘feature extraction’). This process is illustrated for example in Hsu et al.43 where the output of ESM is used as input to a smaller linear regression. In essence, “assay aligned” usage takes an existing pre-trained model and trains it further with assay specific data. The loose idea is that this allows “motifs” and “patterns” that can efficiently be represented by the pre-trained model to be “re-mapped” to the assay data and thereby better extract which motifs might improve or decrease the targeted property. (iii) Assay supervised: in this case the given model is trained in a supervised way directly on assay data without pre-training on other data. Since the amount of available assay data is often very low, the types of models in this approach tend to be general machine learning models (not deep-learning models).
The functional coordinate defined by the assay determines the target property that is to be predicted, e.g. thermostability, solubility and expression, enzyme activity and cumulative characteristics (i.e. a mixed set of properties including general fitness, growth rate in the presence of antibiotic or lysate activity).
Finally design space restrictions can be incorporated, e.g. by explicitly restricting options based on expert knowledge, such as evolutionary or structural data at the following levels: (a) assignment to a specific class of proteins, e.g. an EC category or a particular fold; (b) sequences derived from a specific protein: starting from the WT sequence improvements in the target property are sought by mutating any position or combination of positions in the wildtype enzyme; (c) specific regions of a starting protein are considered preferentially – e.g. mutations in a subregion of the wildtype defined from an evolutionary conservation threshold from an MSA, expert knowledge of key positions or an enzyme structure.
Four groups of common workflow have been tested experimentally (see Tables 1–3) and can be characterized by their primary variations in usage regime and design space (Fig. 4B). We classify these as zero-shot approaches with focused (ZSF) or broad design space (ZSB) on the one hand, and, on the other hand, assay labeled regimes with focused (ALF) or broad design space (ALB). Assay labelled regimes with focused design space are usually informed by data from focused libraries targeting selected positions or regions in the protein only, in contrast to modes with a broader design space which, among others, include random mutagenesis (e.g. by error-prone PCR) across the entire protein.
(i) MutCompute. MutCompute is a deep learning approach (3D convolutional network) that was pre-trained in a self-supervised way based on structures in the Protein Database, by masking out amino acids in a given structure and predicting the identity of the masked amino acid based on the local context (a structural microenvironment defined by a 20 Å cube centered around the masked amino acid). MutCompute was successfully applied to the improvement of a plastic-degrading PETase by Lu et al.40 in zero-shot mode, coming up with 159 variants that were experimentally tested. Combinability studies of the best mutations from this panel yielded FAST-PETase, improved by more than an order of magnitude. Enhancements are larger at higher temperatures, suggesting that temperature adaptation is the main source of catalytic improvement. Additionally, MutCompute was successfully applied with a methyltransferase51 and a β-lactamase.52
(ii) ProteinMPNN (Fig. 4A). ProteinMPNN is another deep learning model (graph neural network) originally created for sequence-redesign given a backbone structure. It is pre-trained in a self-supervised mode by ‘deleting’ the side-chain and amino acid information in a given structure and then re-predicting the correct sequence – position by position (autoregressively) – based only on the backbone and Cβ coordinates, as well as the amino acid types that it already predicted.41 At usage time, a wildtype backbone structure, and optionally the amino acid types for a few fixed positions in the sequence, can be used as input and the remaining sequence is re-designed to fold into that target backbone. ProteinMPNN’s pre-training has been shown to correlate with solubility and thermostability41 (Fig. 4A). The rationale is that ProteinMPNN’s pre-training was based on general protein structures in which certain backbone fragments and motifs re-appear with slightly varied amino acids, such that for a given backbone fragment plausible (but diverse) amino acids are inferred at usage time. Since ProteinMPNN has been trained on structures in the PDB, which predominantly come from crystals and therefore need to be at least modestly stable and soluble, it is thought to predict stable and soluble solutions. Existing protein structures are biased towards these properties simply by virtue of being stable enough to be observed.
A successful zero-shot application of ProteinMPNN for enzyme engineering is the work of Sumida et al.,54 who improved the solubility and stability of TEV protease. In order not to disturb the functionally relevant constituents of the protein, evolutionarily conserved and active site residues were exempted from randomization (Fig. 5A). 129/144 designs exhibited higher levels of soluble expression than the starting point and 64/144 designs showed some activity with a model substrate. The top three designs were further characterised on the model substrate and all showed higher catalytic efficiencies than the parent (up to 26-fold improvements) and the top hit (hyperTEV60) has 40 °C increase in melting temperature Tm. At 30 °C, hyperTEV60 retains 90% of its activity over 4 h, while the parent enzyme only retains 15% activity (Fig. 5B). These observations are consistent with the studies involving MutCompute,40 namely that biophysical robustness brings about an increased ability to form product. Observing an effect on reaction kinetics (with the actual native protease substrate) would provide more direct evidence for transition state stabilization (as opposed to improving the availability of a “competent state”, either by increased Tm or backbone rigidification).
Fig. 5 Machine learning informed engineering of a TEV protease54 and a halogenase.47 (A) Design strategy for TEV protease engineering. Based on structural and evolutionary constraints as input, the design space was defined by fixing the amino acid identities of the active site residues and conserved residues. ProteinMPNN was used to redesign the remaining residues and generate the designed sequences as output. (B) Stability assay. The best design hyperTEV60 shows improved benchtop stability compared to the native TEVd when incubated at 30 °C over time. (C) Identification of engineering sites for WelO5* halogenase. The target substrate soraphen A was docked into WelO5* and three positions were chosen for generating a full randomization library. (D) Activity assays for WelO5* variants. Hits from the combinatorial library (red) and from the ML predictions (green and blue) were tested in biotransformations with cell lysate. Results are displayed as fold increase compared to the parent GAP. The best hit in the combinatorial screen was SLP and the best hit in the ML predictions was VLA. |
The studies provide evidence that, when used in a focussed zero-shot way, models such as Mutcompute and ProteinMPNN can yield catalysts able to generate more reaction product. While biophysical characteristics are improved, the current data is less clear on improvements to the catalytic machinery. It is possible that the emphasis on stability in the pre-training data for self-supervision, which is from the general PDB and may not contain much signal on catalytic proficiency, is responsible for generating proteins mainly improved in structural integrity or solubility. If this is so, then initially unstable proteins should benefit most from these approaches and would make promising targets for ZSF machine learning approaches, although other excellent stability-enhancing algorithms already exist.57 However, such an approach will miss out on potentially destabilizing mutations that may nevertheless be crucial for catalytic activation. Mutations at sites in the protein that were often deliberately excluded in these models (first shell residues, conserved residues) will not be suggested. This conservative bias in the designs may decrease the chance to find designs with improved catalysis, and may be overcome by feeding data on directed evolution trajectories (e.g. from droplet screens) into the algorithms. Higher throughput data from catalytic selections (e.g. in microdroplets) may enhance the value of models currently used in ZSF packages. It remains to be seen whether learning input from comprehensive activity screens (Fig. 3) would give less conservative solutions, overcoming a possible learning bias from the preponderance of stable structures in the training data, and allow better extrapolation towards solutions for catalysis beyond the conditio sine qua non of stability.
‘Smart’ libraries limit the design space to a few randomized residues that can be oversampled, but rely on a reductionist model of protein function that might not reflect reality: mutations far away from the active site and unknown hotspots are often playing unanticipated roles and proteins are typically cooperative (highlighted by the relevance of intra-gene epistasis). ML approaches will play a key role in uncovering these complex higher order phenomena that are often overlooked in traditional experiments. Instead of deep and focused, broad and unbiased coverage of sequence space may be more valuable input data for such ML endeavors.
The experimental approach used for screening determines what type of label can be attached to library members evaluated in a screening experiment. Fully quantitative datasets require cumbersome plate screening or use of high-throughput microfluidic enzyme kinetics (HT-MEK): information on multiple parameters (e.g. activity, specificity, stability) provides excellent input for ML, but the numbers of library members that can be characterized in such detail is practically limited to a few thousands. Higher throughput may aid better predictions, because the increased coverage of sequence space will give ML interpretation and extrapolation a better grounding. Experimental binning of survivors in ultra-high throughput screenings is practically straightforward (e.g. when using FACS25) and provides a ranking based on ‘quantitative categories’. Experimental noise (e.g. overlap of separate bins) may compromise the data quality, but the high throughput and coverage in a microfluidic screen will mitigate this problem to some extent. Binary data, where survivors are merely measured against a threshold activity, avoids possibly experimentally elusive differences between bins and simply labels survivors based on occurrence. Binned and binary data can be obtained straightforwardly in ultrahigh throughput droplet screening, where multimillion membered libraries can be interrogated to come to grips with the combinatorial explosion of higher order interactions. The nature of the quantitative data plays a role: rankings based on lysate assays vs. expression-normalised assays, long-term conversion vs. initial rates, turnover of (undemanding) model substrates vs. (unreactive) natural substrates etc. will be different, so ML interpretations will be biased accordingly. Interpretations of these datasets need to deconvolute the combined effects of stability and activity that contribute differently to the range of quantitative descriptors outlined above. Finally, the experimental approach for sequencing determines the information content further: short reads neglect long-range interactions, but provide deeper information on limited complexity. One objective in this phase of research at the interface of ML and experiment will be to reflect on how these set-up considerations impact interpretations, even though more data must always be best.
Both approaches discussed here, ultrahigh throughput screening and machine learning, have thus far mainly been used as powerful discovery engines of new and improved proteins. To be more than discovery tools, the current challenge is to coordinate the ability of ultrahigh throughput screening to generate large datasets with ML’s potential to read and interpret complex messages, be it on catalysis, molecular recognition or protein evolution. To be useful in this respect, datasets need to be large, well-labelled, diverse and of good quality. Noisy data needs to be paired with robust ML algorithms, to avoid overfitting the noise inherent in the data. Open access protocols for both ML and uHT screening should be made available, to make data compatible and interpretations comparable.
Once the screening/ML interface becomes more established it will be interesting to probe whether alternative models applied to the same dataset lead to similar molecular conclusions: if current predictions already reliably yield robust and stable proteins (e.g. with higher Tm), will the molecular patterns that lead to higher catalytic efficiency also be revealed? The two properties are intertwined (e.g. stability enables catalytic improvement through epistatic interactions) and may be difficult to disaggregate. However, obtaining multiple datasets under different conditions – at varying temperatures or pH or with different substrates – would lead to sequence–function relationships familiar from traditional lower throughput research (e.g. pH-rate profiles, temperature denaturation curves, physical organic analysis of molecular recognition of substrates with varying reactivity or steric requirements), but apply them to many enzyme mutants in one go. If it becomes possible to isolate and understand the molecular responses to such variations, then ML will have made ultrahigh throughput screening a mechanistic tool, able to deal with the challenge of enormous complexity that thus far has made protein engineering more difficult than the original protein engineers envisaged.
Footnote |
† Equal contribution. |
This journal is © The Royal Society of Chemistry 2024 |