生成式 AI 如何助力蛋白质科学研究

如果无法正常显示,请先停止浏览器的去广告插件。
分享至:
1. 生成式AI如何助力 蛋白质科学研究 How Generative AI Accelerates Protein Research ByteDance Research / 郑在翔
2.
3. We’re doing AI for Science at ByteDance Research AI Protein Modeling & Design ◼ Learning Harmonic Molecular ◼ ◼ ◼ ◼ ◼ Representations on Riemannian Manifold. In ICLR 2023 On Pre-training Language Model for Antibody. In ICLR 2023 Structure-informed Language Models Are Protein Designers. In ICML 2023 (oral) Diffusion Language Models Are Versatile Protein Learners. In ICML 2024. Protein Conformation Generation via Force-Guided SE(3) Diffusion Models. In ICML 2024. Antigen-Specific Antibody Design via Direct Energy-based Preference Optimization. preprint. 2024 Small Molecule Design ◼ Regularized Molecular Conformation Fields. In NeurIPS 2022 ◼ Zero-Shot 3D Drug Design by Sketching and Generating. In NeurIPS 2022 ◼ Diffusion Models with Decomposed Priors for Structure-Based Drug Design. In ICML 2023 ◼ DecompOpt: Controllable and Decomposed Diffusion Models for Structure-based Molecular Optimization. In ICLR 2024 Cryo-EM ◼ CryoSTAR: Leveraging Structural Prior and Constraints for Cryo-EM
4. LM-D ESIGN : steering large protein LMs to design protein sequences as structure-conditioned sequence generative models _____________________ Structure-informed Language Models Are Protein Designers. In ICML 2023 (oral)
5. DPLM: A Versatile Protein Foundation Model _____________________ Diffusion Language Models Are Versatile Protein Learners. In ICML 2024.
6. AbDPO: designing antibodies with energy- based DPO _____________________ Antigen-Specific Antibody Design via Direct Energy-based Preference Optimization. 2024 (under review)
7. Small Molecule Drug Design: DecompDiff _____________________ Diffusion Models with Decomposed Priors for Structure-Based Drug Design. In ICML 2023
8. ConDiff: Protein Dynamic Conformation Generation with Physics-guided SE(3) Diffusion Model _____________________ Protein Conformation Generation via Force-Guided SE(3) Diffusion Models. In ICML 2024.
9. Cryo-EM Heterogeneous Reconstruction with CryoStar _____________________ CryoSTAR: Leveraging Structural Prior and Constraints for Cryo-EM Heterogeneous Reconstruction. 2023 (under review)
10. Outline Background ◼ Basics of Generative AI, LLM & Diffusion ◼ Basics of Protein Generative AI x Protein ◼ LLM & Diffusion in AI for Protein, Alphafold & Protein Language Model Large-scale Generative Protein Modeling & Design in ByteDance Research ◼ LM-D ESIGN : Sequence design for given structure w/ protein LLMs ◼ DPLM: A versatile protein foundation model w/ LLM + Diffusion ◼ One more thing: Towards next-gen multimodal protein foundation model?
11.
12. Amazing things that generative AI can do Vision AIs create Arts Large LMs speak Alphafold learns protein folding
13. Deep generative modeling: Learning to generate data “Creating noise from data is easy; Creating data from noise is generative modeling.” — Dr. Yang Song [Score-based SDEs]
14. Deep Generative Models implicit generative models - non-probabilistic Autoregressive models likelihood-based models - probabilistic - w/ or w/o latent variables
15. Autoregressive Language Models Data: Model: Learning Goal: Maximum Likelihood Estimation (MLE)
16. AR-LMs generate data element by element
17. Transformer: Attention (over pairs) is all you need Data: Model: Learning Goal: Maximum Likelihood Estimation (MLE)
18. Diffusion Models: Learning to generate by iterative denoising
19. Notable milestones of generative AI → Multimodal ALL-IN-ONE
20. Outline Background ◼ Basics of Generative AI, LLM & Diffusion ◼ Basics of Protein Generative AI x Protein ◼ LLM & Diffusion in AI for Protein, Alphafold & Protein Language Model Large-scale Generative Protein Modeling & Design in ByteDance Research ◼ LM-D ESIGN : Sequence design for given structure w/ protein LLMs ◼ DPLM: A versatile protein foundation model w/ LLM + Diffusion ◼ One more thing: Towards next-gen multimodal protein foundation model?
21. AI is revolutionizing structural biology
22. Protein: The central dogma of molecular biology _____________________ Credit to Ellen Zhong: The content of the following slides for introduction to structural biology is mostly modified from Ellen Zhong’s keynote speech at
23. Structure biology: The study of proteins and other biomolecules through their 3D structure
24. Structure biology: The study of proteins and other biomolecules through their 3D structure
25. All essential biological processes are carried out by proteins and protein complexes
26. Many proteins are enzymes that catalyze chemical reactions
27. Protein: data modalities protein folding (seq ⇒ struct) sequence ⇔ structure ⇔ function ◼ A sequence over 20 amino acids (AAs) ◼ In solvent will fold into a unique 3D spatial structure with minimal free energy ◼ Structure determines protein function amino acids � -pleated sheet � -helix � -pleated sheet conformation energy landscape protein-protein iteraction � -helices primary sequence/chain secondary structures quaternary structures tertiary structures/ folds (2+ chains that interact)
28. Sequence of amino acids Sequence: 20 types of amino acids Structure (mainly backbone) ◼ 3D XYZ coordinates ◼ Local reference frames (AF2 style) • Ca coords + orientation ◼ Torsion angles ◼ Contact/distance map
29. Atomic coordinates of protein 3D structures Structure (mainly backbone) ◼ 3D XYZ coordinates ◼ Local reference frames (AF2 style) • Ca coords + orientation ◼ Pair-wise contact/distance map
30. Protein Modalities: Sequence and structure and in-between - folding and inverse folding folding alphafold, rosettafold, esmfold, etc amino acid sequence DIVLTQSPSSLSASLGD TITITCHASQNINVWLS WYQQKPGNIPKLLIYKA SNLHTGVPSRFSGSGSG TGFTLTISSLQPEDIATY YCQQGQSYPLTFGGG T……. (structure-based) sequence design aka inverse folding _____________________ pdb id: 1IGT. from protein 3d structure
31. Designing protein sequence and structure as generative modeling problems folding conditional structure generation amino acid sequence protein 3d structure DIVLTQSPSSLSASLGD TITITCHASQNINVWLS WYQQKPGNIPKLLIYKA SNLHTGVPSRFSGSGSG TGFTLTISSLQPEDIATY YCQQGQSYPLTFGGG T……. sequence design structure design sequence-structure co-design inverse folding conditional sequence generation _____________________ pdb id: 1IGT. from
32. Discreteness in NLP and Biology protein languages Goal: learning joint prob. of sequence of discrete tokens Factorization (wrt the structures of data) needed! _____________________ Noelia Ferruz & Birte Höcker. 2022. Controllable protein design with language models. Nature Machine Intelligence.
33. Transformer: Attention (over pairs) is all you need Data: Model: Learning Goal: Maximum Likelihood Estimation (MLE)
34. Diffusion(-like) modeling for Protein Structure: AlphaFold
35. AlphaFold 2: A solution to a 50-year-old grand challenge in biology _____________________ Highly accurate protein structure prediction with AlphaFold. John Jumper, et al. Nature. 2021
36. AlphaFold 2: A solution to a 50-year-old grand challenge in biology _____________________ Highly accurate protein structure prediction with AlphaFold. John Jumper, et al. Nature. 2021
37. AlphaFold 2: A solution to a 50-year-old grand challenge in biology
38. Delve into AF: sequence-conditional structure generation ◼ A solution to a 50-year-old grand challenge in biology _____________________ Highly accurate protein structure prediction with AlphaFold. John Jumper, et al. Nature. 2021
39. Delve into AF: homologous retrieval-augmented generation (RAG) ◼ A solution to a 50-year-old grand challenge in biology _____________________ Highly accurate protein structure prediction with AlphaFold. John Jumper, et al. Nature. 2021
40. Delve into AF: encoder-decoder & Transformers _____________________ Highly accurate protein structure prediction with AlphaFold. John Jumper, et al. Nature. 2021
41. Delve into AF: diffusion(-like) recycling/ iterative refinement
42. Alphafold: summary AF ≈ conditional structure gen w/ Transformer + RAG + Diffusion
43. Protein Sequence Modeling with LLMs
44. Recap: LLMs for natural language Simple & universal law of the scale: the larger the merrier _____________________ Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama et al. "Emergent Abilities of Large Language Models." Transactions on Machine Learning Research. 2022 _____________________ Yao Fu, Hao Peng and Tushar Khot. How does GPT Obtain its Ability? Tracing Emergent Abilities of Language Models to their Sources. 2022. On Yao Fu’s Notion
45. Protein Language models ( p LMs) Two types of commonly-used protein LMs ◼ Protein Sequence Encoder: predictive models for classifications and regressions • • formulation: psudo-likelihood ∏ p(a i | {a j≠i ∈ seq}) by MLM, DAE, etc. Instance (BERT-like): ESM-1b, ESM 2 series ◼ Protein Sequence Decoder: generative models for learning distributions and synthesizing sequences • formulation: likelihood ∏ p(a i | {a j<i ∈ seq}) by autoregressive/causal • LM Instance (GPT-like): ProGen2, ProGPT
46. ESM: Evolutionary Scale Modeling BERT analog for proteins. Learned with MLM (15% 8/1/1) on 250M sequences. ESM-1b: 650M params. _____________________ Rives, A., Meier, J., Sercu, T., Goyal, S., Lin, Z., Liu, J., Guo, D., Ott, M., Zitnick, C.L., Ma, J., Fergus, R., et al. Biological structure and function emerge from scaling unsupervised learning to 250 million
47. ProGen: next ChatGPT for proteins? GPT-like autoregressive model on sequences _____________________ Madani, A., Krause, B., Greene, E.R., Subramanian, S., Mohr, B.P., Holton, J.M., Olmos Jr, J.L., Xiong, C., Sun, Z.Z., Socher, R. and Fraser, J.S. Large language models generate functional protein sequences
48. ESM-2 series: scaling makes different Scaling is all you need: just as in LLMs for natural languages ◼ emergent abilities: structural awareness ◼ phase-transition at certain scale threshold _____________________ Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., Smetanin, N., Verkuil, R., Kabeli, O., Shmueli, Y. and dos Santos Costa, A., 2022. Evolutionary-scale prediction of atomic level protein structure with a
49. ESMFold: Protein Folding using pLMs at scale ESMFold: pLMs at scale enable single-sequence structure prediction ◼ ESM-2 + structural module ◼ Comparable to AF2, but needing no homologs and 60x faster _____________________ Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., Smetanin, N., Verkuil, R., Kabeli, O., Shmueli, Y. and
50. (LLM + Diffusion) x Protein: Large-scale Generative Protein Modeling & Design
51. Outline Background ◼ Basics of Generative AI, LLM & Diffusion ◼ Basics of Protein Generative AI x Protein ◼ LLM & Diffusion in AI for Protein, Alphafold & Protein Language Model Large-scale Generative Protein Modeling & Design in ByteDance Research ◼ LM-D ESIGN : Sequence design for given structure w/ protein LLMs ◼ DPLM: A versatile protein foundation model w/ LLM + Diffusion ◼ One more thing: Towards next-gen multimodal protein foundation model?
52. Notable milestones of generative AI → Multimodal ALL-IN-ONE
53. aixiang Zheng 1 *, Yifan Deng 2 *, Dongyu Xue 1 , Yi Zhou 1 , Fei Ye 1 and Quanquan Gu 1 1 ByteDance Research & 2 UW-Madison
54. “NMT moment” for Structure-based Sequence Design folding conditional structure generation amino acid sequence protein 3d structure DIVLTQSPSSLSASLGD TITITCHASQNINVWLS WYQQKPGNIPKLLIYKA SNLHTGVPSRFSGSGSG TGFTLTISSLQPEDIATY YCQQGQSYPLTFGGG T……. sequence design structure design sequence-structure co-design inverse folding conditional sequence generation
55. ByteDance P ROGRESS OF DL- BASED PROTEIN SEQUENCE DESIGN Meta U. Washington (David Baker) Westlake Accuracy MIT LM-D ESIGN Stanford PiFold NYU ESM-IF ProteinMPNN GCA Structure Transformer 2019 DenseCPD GVP-GNN Radius proportional to the model scale 2020 2021 2022 2023 Year
56. Structure-based protein sequence design/inverse folding Definition: to find amino acid sequence that can fold into a desired protein backbone structure , by learning a probabilisitic model over a certain amount of protein structure- sequence data. Existing work: graph-to-sequence [1] autoregressive modeling (StructTransformer , [2] [3] ProteinMPNN , ESM-IF , etc) _____________________ [1] Ingraham, et al. Generative models for graph-based protein design. In NeurIPS 2019. [2] Dauparas, et al. Robust deep learning–based protein sequence design using ProteinMPNN. Science
57. Challenges and motivations Limited experimentally determined structures Structurally non-deterministic Left-to-right regions are less informative and autoregressive models much harder not necessarily best fit spatial structured data like proteins <0.1% known structures massive protein sequences millions to billions Sequential evolutionary knowledge should be better considered
58. Large-scale protein language models can help LANGUAGE MODELS KNOW SEQUENCES THE BEST! ◼ Protein LMs ( p LMs, e.g., ESM- 1b/ESM-2), learned from the universe of massive protein sequences, have demonstrated emergent pLMs are such strong sequence evolutionary knowledge to enable learners capabilities. amazing Q: can pLMs be better structure- based protein designers? _____________________ [ESM] Rives, et tal. Biological structure and function emerge from
59. LM-D ESIGN : reprogramming p LMs as structure-conditioned sequence generative models Structural surgery: implanting a lightweight structural adapter into a strong pretrained pLM ◼ we focus on Bert-like MLMs (e.g., ESM-1b/ESM2) for bidirectional receptive fields strong sequence generative capability ◼ LM-D ESIGN = � pretrained pLM as sequence decoder + protein structure structure encoder understanding + structural adapter structure-sequence _____________________ aligner / translator protein fig credit: RFDiffusion.
60. LM-D ESIGN : reprogramming p LMs as structure-conditioned sequence generative models ◼ Training: conditional masked language modeling (CMLM) with p LMs frozen ◼ Diffusion-like Inference: full-sequence iterative recycling for T times: refinement/denoising for ~5 cycles [cls] MKTVR QER LKS IVR ILE R [eos] sequence decoder: pLM structure (GNNs, ProteinMPNN, GVP, IPA, etc.) encoder (ESM series, etc) structural adapter FFN Multihead A TTN Transformer layer Multihead A TTN + FFN [cls] Y KTVR AGR LGS IS R S LE R [eos] iteratively refine × T × N
61. LM-D ESIGN Improves SoTA results by a large margin (4%-12%) ◼ Non-AR modeling is a more proper probabilisitic model for protein data
62. LM-D ESIGN Improves SoTA results by a large margin (4%-12%) ◼ Data- & parameter-efficient: outperforming without any additional data (<2% trainable)
63. LM-D ESIGN Improves SoTA results by a large margin (4%-12%) ◼ Data- & parameter-efficient: outperforming without any additional data (<2% trainable) ◼ Modularizable: further benefit from pretrained structure encoders
64. LM-D ESIGN Improves SoTA results by a large margin (4%-12%) ◼ Data- & parameter-efficient: outperforming without any additional data (<2% trainable) ◼ Modularizable: further benefit from pretrained structure encoders ◼ Standing on the shoulders of giants: structure encoders ↑, performance ↑
65. LM-D ESIGN Improves SoTA results by a large margin (4%-12%) ◼ Data-efficient: outperforming without any additional data ◼ Modularizable: further benefit from pretrained structure encoders ◼ Standing on the shoulders of giants: structure encoders ↑, performance ↑ ◼ Beyond single-chain: can also design protein complexes
66. Delve deep into LM-D ESIGN : S tructural validation using folding models ◼ Higher self-consistency TMScore using ESMFold: the designs can fold into the given structure ◼ AlphaFold2 thinks that our redesigns are predicted to adopt the given structures more confidently than the native sequences.
67. Dive deep into LM-D ESIGN : studies on inference/sampling ◼ Iterative refinement gives rise to accurate sequence design. ◼ LM-D ESIGN yields diverse yet more accurate designs.
68. Dive deep into LM-D ESIGN : Scaling helps ◼ LM-D ESIGN works well with scaling data, augmented via predicted structures from AF2 ◼ The larger the merrier: LM- D ESIGN is scalable wrt model sizes yet parameter-efficient
69. How LM-D ESIGN Improves protein design? Studies on structures ◼ LM-D ESIGN exploits the potentials of both structural and sequential capabilities. ◼ LM-D ESIGN is structurally sensitive that could determine functionally specific sequences.
70. Immediate zero-shot generalization to other proteins ◼ Independent held- out sets of TS50 and TS500. ◼ De novo proteins. ◼ Antibody CDRs inpainting
71. Takeaways ◼ We introduce LM-D ESIGN , a generic approach that reprograms pLMs to strong structure-based sequence designers. ◼ LM-D ESIGN is a model-agnostic, modularizable, parameter- and data-efficient framework. ◼ Take-home messages: (1)LMs are gold mines for biological sequences (2)� autoregression; ✅ iterative refinement (3)We hope that LM-D ESIGN can serve as a powerful, universal, and easy- to-use tool as a “wrapper” that paper code
72. “GPT-3.5 moment” for Sequence Modeling & Design folding conditional structure generation amino acid sequence sequence DIVLTQSPSSLSASLGD modeling & design TITITCHASQNINVWLS WYQQKPGNIPKLLIYKA SNLHTGVPSRFSGSGSG TGFTLTISSLQPEDIATY YCQQGQSYPLTFGGG T……. protein 3d structure structure design sequence-structure co-design inverse folding conditional sequence generation
73. [ ICML 2024 ] Xinyou Wang* ♢ ♡ , Zaixiang Zheng* ♡ , Fei Ye ♡ , Dongyu Xue ♡ , Shujian Huang ♢ , and Quan ♡ ByteDance Research & ♢ Nanjing University
74. Revisit current paradigms of protein language modeling Autoregression Masked Prediction (AR-LM, GPT-equivalent) (Masked-LM, BERT-equivalent) R T A K ? Y ? T K A ? ? ? strong predictive capability no generative capability v.s. R A weak predictive capability Y sub-optimal generative capability position-wise conditional independence probabilistic model learning objective ? T K S sequential factorization (chain rule) uni-directional / one-sided: not suited for protein bi-directional / global: well-suited for protein receptive field not well-defined, not a generative model � generation left-to-right sequential decoding � protein access to bi-directional context � understanding only access to one-sided context � protein ESM-1b / ESM2 family examples ProGen, ProGPT
75. What we really need for the next-gen protein LMs? key ingredients The analysis highlights the demand (1) strong & scalable generative for a general-purpose and versatile modeling framework to best digest protein LM that combines predictive the universe of massive protein and generative capabilities sequences; and “what you cannot create, you do not understand.” (2) bi-directional receptive field for ⇒ — Richard Feymann better modeling residue-wise global interactions. protein language modeling w/ (discrete) diffusion framework combines the best of both worlds, i.e., blending the scalable expressiveness of language models and the strong generative power of diffusion models, approaching a versatile
76. DPLM: a versatile protein LM w/ discrete diffusion
77. DPLM: a versatile protein LM w/ discrete diffusion learning objective of discrete diffu generalizes language modeling, covering Masked-LMs & AR-LMs pre-trained on the universe of evolutionary-scale protein sequences DPLM can do: 1. unconditional generation 2. learns effective representation for downstream predictive tasks 3. conditional generation spanning • sequence conditioning • cross-modal conditioning • plug-&-play controllable generatio desired preference w/ classifier gu
78. Unconditional generation: DPLM generates structurally reasonable proteins DPLM is capable of generating highly structurally plausible (i.e., averaged pLDDT > 85), novel and diverse for unconditional protein sequence generation, suggesting that DPLM well captures the underlying distribution of protein sequence data.
79. Evaluation of Protein Representation Learning on Predictive Tasks DPLM is a superior protein sequence representation learner, outperforming Masked-LM (ESM2) and AR-LM while performance can improve with scaling.
80. Conditional generation of DPLM for various needs cross-modal conditioning sequence conditioning (motif-scaffolding): DPLM can generate reasonable scaffolds for given functional motifs at high success rate controllable generation towards desired preference (secondary structure guided protein sampling): DPLM enjoys plug-and-play programmability, steered to synthesize proteins that satisfy arbitrary user-defined secondary (inverse folding): DPLM yields sequences that can accurately fold into the given backbone structure.
81. Takeaways - DPLM ◼ We introduce diffusion protein LM (DPLM), a versatile protein LM that is capable of both protein sequence generation and representation learning, as well as various needs of conditional generation, including sequence conditioning, cross-modal conditioning, and programmable generation with plug-and- play discrete classifier guidance. ◼ Potential future directions: (1)Exploring DPLM’s conditional generation for wider applications, (2)DPLM can further benefit from best practices of cutting-edge technical advancement in the vastness of large language models (LLMs), (3)It is imperative to integrate protein structure modeling into DPLM. Developing a universal protein language model with the next-generation DPLM, which accounts for both sequence and structure, is a particularly promising avenue. paper
82. What’s next? “GPT-4 moment” for multimodal protein foundation models?
83. Towards Unified Multimodal Protein Foundation Models folding conditional structure generation amino acid sequence protein 3d structure DIVLTQSPSSLSASLGD TITITCHASQNINVWLS WYQQKPGNIPKLLIYKA SNLHTGVPSRFSGSGSG TGFTLTISSLQPEDIATY YCQQGQSYPLTFGGG T……. sequence generation structure generation sequence-structure co-design inverse folding conditional sequence generation _____________________ pdb id: 1IGT. from
84. Multimodal-DPLM: One Model Can Do Whatever You Need for Proteins unconditional structure design noise → structure → (sequence, structure) ⇒ ⇒ o-design: noise → structure folding: sequence ⇒ applications: e.g, designing symmetric oligomers
85. We’re doing AI for Science at ByteDance Research AI Protein Modeling & Design ◼ Learning Harmonic Molecular ◼ ◼ ◼ ◼ ◼ Representations on Riemannian Manifold. In ICLR 2023 On Pre-training Language Model for Antibody. In ICLR 2023 Structure-informed Language Models Are Protein Designers. In ICML 2023 (oral) Diffusion Language Models Are Versatile Protein Learners. In ICML 2024. Protein Conformation Generation via Force-Guided SE(3) Diffusion Models. In ICML 2024. Antigen-Specific Antibody Design via Direct Energy-based Preference Optimization. preprint. 2024 Small Molecule Design ◼ Regularized Molecular Conformation Fields. In NeurIPS 2022 ◼ Zero-Shot 3D Drug Design by Sketching and Generating. In NeurIPS 2022 ◼ Diffusion Models with Decomposed Priors for Structure-Based Drug Design. In ICML 2023 ◼ DecompOpt: Controllable and Decomposed Diffusion Models for Structure-based Molecular Optimization. In ICLR 2024 Cryo-EM ◼ CryoSTAR: Leveraging Structural Prior and Constraints for Cryo-EM
86.
87. LM-D ESIGN DPLM

Home - Wiki
Copyright © 2011-2024 iteam. Current version is 2.138.0. UTC+08:00, 2024-12-22 09:15
浙ICP备14020137号-1 $Map of visitor$