Evolla: Decoding Protein Molecular Language

Date:

Language: Chinese

Invited by the AI4Protein community.

Abstract: Sequence determines structure, and structure determines function. Recent advances in sequencing and structure prediction (e.g., AlphaFold) have made it possible to obtain protein sequences and structures at scale. A natural next question is how to derive textual, functional descriptions directly from protein sequence and structure. Evolla is a protein-to-text multimodal large language model trained on 546 million (protein, question, answer) triplets, comprising over 150 billion word tokens. After pretraining, Evolla is further refined with Direct Preference Optimization (DPO) and Retrieval-Augmented Generation (RAG). We also introduce Instructional Response Space (IRS) for evaluation; on EC number classification, Evolla-10B performs comparably to CLEAN, demonstrating strong functional inference capability.

Related resources: GitHub · Demo

Recording: Bilibili

Direct Link