Bilingual Language Model for Protein Sequence and Structure
Abstract Adapting large language models (LLMs) to protein sequences spawned the development of powerful protein language models (pLMs). Concurrently, AlphaFold2 broke through in protein structure prediction. Now we can systematically and comprehensively explore the dual nature of proteins that act and exist as three-dimensional (3D) machines and evolve as linear strings of one-dimensional (1D) sequences. Here, we leverage pLMs to simultaneously model both modalities by combining 1D sequences with 3D structure in a single model. We encode protein structures as token sequences using the 3Di-alphabet introduced by the 3D-alignment methodFoldseek. This new foundation pLM extracts the features and patterns of the resulting “structure-sequence” representation. Toward this end, we built a non-redundant dataset from AlphaFoldDB and fine-tuned an existing pLM (ProtT5) to translate between 3Di and amino acid sequences. As a proof-of-concept for our novel approach, dubbed Protein structure-sequence T5 (<jats:underline>ProstT5</jats:underline>), we showed improved performance for subsequent prediction tasks, and for “inverse folding”, namely the generation of novel protein sequences adopting a given structural scaffold (“fold”). Our work showcased the potential of pLMs to tap into the information-rich protein structure revolution fueled by AlphaFold2.ProstT5paves the way to develop new tools integrating the vast resource of 3D predictions, and opens new research avenues in the post-AlphaFold2 era. Our model is freely available for all at<jats:ext-link xmlns:xlink="http://www.w3.org/1999/xlink" ext-link-type="uri" xlink:href="https://github.com/mheinzinger/ProstT5">https://github.com/mheinzinger/ProstT5</jats:ext-link>..
Medienart: |
Preprint |
---|
Erscheinungsjahr: |
2024 |
---|---|
Erschienen: |
2024 |
Enthalten in: |
bioRxiv.org - (2024) vom: 27. März Zur Gesamtaufnahme - year:2024 |
---|
Sprache: |
Englisch |
---|
Beteiligte Personen: |
Heinzinger, Michael [VerfasserIn] |
---|
Links: |
Volltext [kostenfrei] |
---|
Themen: |
---|
doi: |
10.1101/2023.07.23.550085 |
---|
funding: |
|
---|---|
Förderinstitution / Projekttitel: |
|
PPN (Katalog-ID): |
XBI040317277 |
---|
LEADER | 01000caa a22002652 4500 | ||
---|---|---|---|
001 | XBI040317277 | ||
003 | DE-627 | ||
005 | 20240328090508.0 | ||
007 | cr uuu---uuuuu | ||
008 | 230726s2024 xx |||||o 00| ||eng c | ||
024 | 7 | |a 10.1101/2023.07.23.550085 |2 doi | |
035 | |a (DE-627)XBI040317277 | ||
035 | |a (biorXiv)10.1101/2023.07.23.550085 | ||
040 | |a DE-627 |b ger |c DE-627 |e rakwb | ||
041 | |a eng | ||
100 | 1 | |a Heinzinger, Michael |e verfasserin |0 (orcid)0000-0002-9601-3580 |4 aut | |
245 | 1 | 0 | |a Bilingual Language Model for Protein Sequence and Structure |
264 | 1 | |c 2024 | |
336 | |a Text |b txt |2 rdacontent | ||
337 | |a Computermedien |b c |2 rdamedia | ||
338 | |a Online-Ressource |b cr |2 rdacarrier | ||
520 | |a Abstract Adapting large language models (LLMs) to protein sequences spawned the development of powerful protein language models (pLMs). Concurrently, AlphaFold2 broke through in protein structure prediction. Now we can systematically and comprehensively explore the dual nature of proteins that act and exist as three-dimensional (3D) machines and evolve as linear strings of one-dimensional (1D) sequences. Here, we leverage pLMs to simultaneously model both modalities by combining 1D sequences with 3D structure in a single model. We encode protein structures as token sequences using the 3Di-alphabet introduced by the 3D-alignment methodFoldseek. This new foundation pLM extracts the features and patterns of the resulting “structure-sequence” representation. Toward this end, we built a non-redundant dataset from AlphaFoldDB and fine-tuned an existing pLM (ProtT5) to translate between 3Di and amino acid sequences. As a proof-of-concept for our novel approach, dubbed Protein structure-sequence T5 (<jats:underline>ProstT5</jats:underline>), we showed improved performance for subsequent prediction tasks, and for “inverse folding”, namely the generation of novel protein sequences adopting a given structural scaffold (“fold”). Our work showcased the potential of pLMs to tap into the information-rich protein structure revolution fueled by AlphaFold2.ProstT5paves the way to develop new tools integrating the vast resource of 3D predictions, and opens new research avenues in the post-AlphaFold2 era. Our model is freely available for all at<jats:ext-link xmlns:xlink="http://www.w3.org/1999/xlink" ext-link-type="uri" xlink:href="https://github.com/mheinzinger/ProstT5">https://github.com/mheinzinger/ProstT5</jats:ext-link>. | ||
650 | 4 | |a Biology |7 (dpeaa)DE-84 | |
650 | 4 | |a 570 |7 (dpeaa)DE-84 | |
700 | 1 | |a Weissenow, Konstantin |0 (orcid)0000-0002-2205-1795 |4 aut | |
700 | 1 | |a Sanchez, Joaquin Gomez |0 (orcid)0000-0001-8876-8660 |4 aut | |
700 | 1 | |a Henkel, Adrian |4 aut | |
700 | 1 | |a Mirdita, Milot |0 (orcid)0000-0001-8637-6719 |4 aut | |
700 | 1 | |a Steinegger, Martin |0 (orcid)0000-0001-8781-9753 |4 aut | |
700 | 1 | |a Rost, Burkhard |0 (orcid)0000-0003-0179-8424 |4 aut | |
773 | 0 | 8 | |i Enthalten in |t bioRxiv.org |g (2024) vom: 27. März |
773 | 1 | 8 | |g year:2024 |g day:27 |g month:03 |
856 | 4 | 0 | |u http://dx.doi.org/10.1101/2023.07.23.550085 |z kostenfrei |3 Volltext |
912 | |a GBV_XBI | ||
951 | |a AR | ||
952 | |j 2024 |b 27 |c 03 |