Luca Naef (VantAI)
🔥What are the biggest advancements in the field you noticed in 2023?
1️⃣ Increasing multi-modality & modularity — as shown by the emergence of initial co-folding methods for both proteins & small molecules, diffusion and non-diffusion-based, to extend on AF2 success: DiffusionProteinLigand in the last days of 2022 and RFDiffusion, AlphaFold2 and Umol by end of 2023. We are also seeing models that have sequence & structure co-trained: SAProt, ProstT5, and sequence, structure & surface co-trained with ProteinINR. There is a general revival of surface-based methods after a quieter 2021 and 2022: DiffMasif, SurfDock, and ShapeProt.
2️⃣ Datasets and benchmarks. Datasets, especially synthetic/computationally derived: ATLAS and the MDDB for protein dynamics. MISATO, SPICE, Splinter for protein-ligand complexes, QM1B for molecular properties. PINDER: large protein-protein docking dataset with matched apo/predicted pairs and benchmark suite with retrained docking models. CryoET data portal for CryoET. And a whole host of welcome benchmarks: PINDER, PoseBusters, and PoseCheck, with a focus on more rigorous and practically relevant settings.
3️⃣ Creative pre-training strategies to get around the sparsity of diverse protein-ligand complexes. Van-der-mers training (DockGen) & sidechain training strategies in RF-AA and pre-training on ligand-only complexes in CCD in RF-AA. Multi-task pre-training Unimol and others.
🏋️ What are the open challenges that researchers might overlook?
1️⃣ Generalization. DockGen showed that current state-of-the-art protein-ligand docking models completely lose predictability when asked to generalise towards novel protein domains. We see a similar phenomenon in the AlphaFold-lastest report, where performance on novel proteins & ligands drops heavily to below biophysics-based baselines (which have access to holo structures), despite very generous definitions of novel protein & ligand. This indicates that existing approaches might still largely rely on memorization, an observation that has been extensively argued over the years
2️⃣ The curse of (simple) baselines. A recurring topic over the years, 2023 has again shown what industry practitioners have long known: in many practical problems such as molecular generation, property prediction, docking, and conformer prediction, simple baselines or classical approaches often still outperform ML-based approaches in practice. This has been documented increasingly in 2023 by Tripp et al., Yu et al., Zhou et al.
🔮 Predictions for 2024!
“In 2024, data sparsity will remain top of mind and we will see a lot of smart ways to use models to generate synthetic training data. Self-distillation in AlphaFold2 served as a big inspiration, Confidence Bootstrapping in DockGen, leveraging the insight that we now have sufficiently powerful models that can score poses but not always generate them, first realised in 2022.” — Luca Naef (VantAI)
2️⃣ We will see more biological/chemical assays purpose-built for ML or only making sense in a machine learning context (i.e., might not lead to biological insight by themselves but be primarily useful for training models). An example from 2023 is the large-scale protein folding experiments by Tsuboyama et al. This move might be driven by techbio startups, where we have seen the first foundation models built on such ML-purpose-built assays for structural biology with e.g. ATOM-1.
Andreas Loukas (Prescient Design, part of Genentech)
🔥 What are the biggest advancements in the field you noticed in 2023?
“In 2023, we started to see some of the challenges of equivariant generation and representation for proteins to be resolved through diffusion models.” — Andreas Loukas (Prescient Design)
1️⃣ We also noticed a shift towards approaches that model and generate molecular systems at higher fidelity. For instance, the most recent models adopt a fully end-to-end approach by generating backbone, sequence and side-chains jointly (AbDiffuser, dyMEAN) or at least solve the problem in two steps but with a partially joint model (Chroma); as compared to backbone generation followed by inverse folding as in RFDiffusion and FrameDiff. Other attempts to improve the modelling fidelity can be found in the latest updates to co-folding tools like AlphaFold2 and RFDiffusion which render them sensitive to non-protein components (ligands, prosthetic groups, cofactors); as well as in papers that attempt to account for conformational dynamics (see discussion above). In my view, this line of work is essential because the binding behaviour of molecular systems can be very sensitive to how atoms are placed, move, and interact.
2️⃣ In 2023, many works also attempted to get a handle on binding affinity by learning to predict the effect of mutations of a known crystal by pre-training on large corpora, such as computationally predicted mutations (graphinity), and on side-tasks, such as rotamer density estimation. The obtained results are encouraging as they can significantly outperform semi-empirical baselines like Rosetta and FoldX. However, there is still significant work to be done to render these models reliable for binding affinity prediction.
3️⃣ I have further observed a growing recognition of protein Language Models (pLMs) and specifically ESM as valuable tools, even among those who primarily favour geometric deep learning. These embeddings are used to help docking models, allow the construction of simple yet competitive predictive models for binding affinity prediction (Li et al 2023), and can generally offer an efficient method to create residue representations for GNNs that are informed by the extensive proteome data without the need for extensive pretraining (Jamasb et al 2023). However, I do maintain a concern regarding the use of pLMs: it is unclear whether their effectiveness is due to data leakage or genuine generalisation. This is particularly pertinent when evaluating models on tasks like amino-acid recovery in inverse folding and conditional CDR design, where distinguishing between these two factors is crucial.
🏋️ What are the open challenges that researchers might overlook?
1️⃣ Working with energetically relaxed crystal structures (and, even worse, folded structures) can significantly affect the performance of downstream predictive models. This is especially true for the prediction of protein-protein interactions (PPIs). In my experience, the performance of PPI predictors severely deteriorates when they are given a relaxed structure as opposed to the binding (holo) crystalised structure.
2️⃣ Though successful in silico antibody design has the capacity to revolutionise drug design, general protein models are not (yet?) as good at folding, docking or generating antibodies as antibody-specific models are. This is perhaps due to the low conformational variability of the antibody fold and the distinct binding mode between antibodies and antigens (loop-mediated interactions that can involve a non-negligible entropic component). Perhaps for the same reasons, the de novo design of antibody binders (that I define as 0-shot generation of an antibody that binds to a previously unseen epitope) remains an open problem. Currently, experimentally confirmed cases of de novo binders involve mostly stable proteins, like alpha-helical bundles, that are common in the PDB and harbour interfaces that differ substantially from epitope-paratope interactions.
3️⃣ We are still lacking a general-purpose proxy for binding free energy. The main issue here is the lack of high-quality data of sufficient size and diversity (esp. co-crystal structures). We should therefore be cognizant of the limitations of any such learned proxy for any model evaluation: though predicted binding scores that are out of distribution of known binders is a clear signal that something is off, we should avoid the typical pitfall of trying to demonstrate the superiority of our model in an empirical evaluation by showing how it leads to even higher scores.
Dominique Beaini (Valence Labs, part of Recursion)
“I’m excited to see a very large community being built around the problem of drug discovery, and I feel we are on the brink of a new revolution in the speed and efficiency of discovering drugs.” — Dominique Beaini (Valence Labs)
What work got me excited in 2023?
I am confident that machine learning will allow us to tackle rare diseases quickly, stop the next COVID-X pandemic before it can spread, and live longer and healthier. But there’s a lot of work to be done and there are a lot of challenges ahead, some bumps in the road, and some canyons on the way. Speaking of communities, you can visit the Valence Portal to keep up-to-date with the 🔥 new in ML for drug discovery.
What are the hard questions for 2024?
⚛️ A new generation of quantum mechanics. Machine learning force-fields, often based on equivariant and invariant GNNs, have been promising us a treasure. The treasure of the precision of density functional theory, but thousands of times faster and at the scale of entire proteins. Although some steps were made in this direction with Allegro and MACE-MP, current models do not generalize well to unseen settings and very large molecules, and they are still too slow to be applicable on the timescale that is needed 🐢. For the generalization, I believe that bigger and more diverse datasets are the most important stepping stones. For the computation time, I believe we will see models that are less enforcing of the equivariance, such as FAENet. But efficient sampling methods will play a bigger role: spatial-sampling such as using DiffDock to get more interesting starting points and time-sampling such as TimeWarp to avoid simulating every frame. I’m really excited by the big STEBS 👣 awaiting us in 2024: Spatio-temporal equivariant Boltzmann samplers.
🕸️ Everything is connected. Biology is inherently multimodal 🙋🐁 🧫🧬🧪. One cannot simply decouple the molecule from the rest of the biological system. Of course, that’s how ML for drug discovery was done in the past: simply build a model of the molecular graph and fit it to experimental data. But we have reached a critical point 🛑, no matter how many trillion parameters are in the GNN model is, and how much data are used to train it, and how many experts are mixtured together. It is time to bring biology into the mix, and the most straightforward way is with multi-modal models. One method is to condition the output of the GNNs with the target protein sequences such as MocFormer. Another is to use microscopy images or transcriptomics to better inform the model of the biological signature of molecules such as TranSiGen. Yet another is to use LLMs to embed contextual information about the tasks such as TwinBooster. Or even better, combining all of these together 🤯, but this could take years. The main issue for the broader community seems to be the availability of large amounts of quality and standardized data, but fortunately, this is not an issue for Valence.
🔬 Relating biological knowledge and observables. Humans have been trying to map biology for a long time, building relational maps for genes 🧬, protein-protein interactions 🔄, metabolic pathways 🔀, etc. I invite you to read this review of knowledge graphs for drug discovery. But all this knowledge often sits unused and ignored by the ML community. I feel that this is an area where GNNs for knowledge graphs could prove very useful, especially in 2024, and it could provide another modality for the 🕸️ point above. Considering that human knowledge is incomplete, we can instead recover relational maps from foundational models. This is the route taken by Phenom1 when trying to recall known genetic relationships. However, having to deal with various knowledge databases is an extremely complex task that we can’t expect most ML scientists to be able to tackle alone. But with the help of artificial assistants like LOWE, this can be done in a matter of seconds.
🏆 Benchmarks, benchmarks, benchmarks. I can’t repeat the word benchmark enough. Alas, benchmarks will stay the unloved kid on the ML block 🫥. But if the word benchmark is uncool, its cousin competition is way cooler 😎! Just as the OGB-LSC competition and Open Catalyst challenge played a major role for the GNN community, it is now time for a new series of competitions 🥇. We even got the TGB (Temporal graph benchmark) recently. If you were at NeurIPS’23, then you probably heard of Polaris coming up early 2024 ✨. Polaris is a consortium of multiple pharma and academic groups trying to improve the quality of available molecular benchmarks to better represent real drug discovery. Perhaps we’ll even see a benchmark suitable for molecular graph generation instead of optimizing QED and cLogP, but I wouldn’t hold my breath, I have been waiting for years. What kind of new, crazy competition will light up the GDL community this year 🤔?