Accelerating Protein Structure Prediction at Proteome-Scale

Source
Accelerating Protein Structure Prediction at Proteome-Scale

Proteins rarely function in isolation as individual monomers. Most biological processes are governed by proteins interacting with other proteins, forming protein complexes whose structures are described in the hierarchy of protein structure as the quaternary representation. This represents one level of complexity up from tertiary representations, the 3D structure of monomers, which are commonly known since the emergence of AlphaFold2 and the creation of the Protein Data Bank. However, structural information for the vast majority of complexes remains unavailable.

The AlphaFold Protein Structure Database (AFDB), jointly developed by Google DeepMind and EMBL’s European Bioinformatics Institute (EMBL-EBI), transformed access to monomeric protein structures, but interaction-aware structural biology at the proteome scale has remained a bottleneck with unique challenges. These include massive combinatorial interaction space, high computational cost for multiple sequence alignment (MSA) generation, and protein folding, as well as inference scaling across millions of complexes.

In recent work, we extended the AFDB with large-scale predictions of homomeric protein complexes generated by a high-throughput pipeline based on AlphaFold-Multimer, made possible by NVIDIA accelerated computing. We also predicted heteromeric complexes to compare the accuracy of different complex prediction modalities. For these predictions, we leveraged kernel-level accelerations from MMseqs2-GPU for MSA generation and NVIDIA TensorRT for deep-learning-based protein folding.

This article describes the major principles we adopted to increase protein folding throughput, from adopting libraries and SDKs to optimizations that reduce the computational complexity of the workload. These principles can help you set up a similar pipeline by borrowing from the techniques we used to create this new dataset.

If you are a computational biologist, AI researcher, HPC engineer, or bioinformatician, you will learn how to design a proteome-scale complex prediction strategy, separate MSA generation from structure inference for efficiency, and scale AlphaFold-Multimer workflows across GPU clusters.

Related articles