Sitemap

A list of all the posts and pages found on the site. For you robots out there, there is an XML version available for digesting as well.

Pages

Posts

From hypothesis to experiment

22 minute read

Published:

概念层与中间层之后,物理层负责把已确定的 assay 转换成可执行的 procedure、transport 与 resources。本文系统梳理 Procedure、Transport、Transferable Resource、按 I/O 签名分类的 Device 及 Operator 等 Resource 的定义、关系与输出结构。

From hypothesis to experiment

17 minute read

Published:

在概念层的 hypothesis 与 claims 之后,中间层负责把 claim 拆解为可实验判定的 sub-claims,并为每个 sub-claim 匹配能够产生证据的 assay。本文系统梳理中间层的设计原则、对象定义与输出结构。

From hypothesis to experiment

4 minute read

Published:

随着越来越多的蛋白质被计算方法自动标注,社区仍存在明显 gap:功能假设生成很快,验证很慢。本文从概念层的 hypothesis 与 claims 开始,梳理从假设到实验验证的分层框架。

portfolio

publications

SaProt: Protein language modeling with structure-aware vocabulary

Published in International Conference on Learning Representations (ICLR), 2023

Large-scale protein language models (PLMs), such as the ESM family, have achieved remarkable performance in various downstream tasks related to protein structure and function by undergoing unsupervised training on residue sequences. They have become essential tools for researchers and practitioners in biology. However, a limitation of vanilla PLMs is their lack of explicit consideration for protein structure information, which suggests the potential for further improvement. Motivated by this, we introduce the concept of a “structure-aware vocabulary” that integrates residue tokens with structure tokens. The structure tokens are derived by encoding the 3D structure of proteins using Foldseek. We then propose SaProt, a large-scale general-purpose PLM trained on an extensive dataset comprising approximately 40 million protein sequences and structures. Through extensive evaluation, our SaProt model surpasses well-established and renowned baselines across 10 significant downstream tasks, demonstrating its exceptional capacity and broad applicability.

Recommended citation: J Su, C Han, Y Zhou, J Shan, X Zhou, F Yuan. (2024). "SaProt: Protein language modeling with structure-aware vocabulary." ICLR.
Download Paper | Download Slides

Protein language models-assisted optimization of a uracil-N-glycosylase variant enables programmable T-to-G and T-to-C base editing

Published in Molecular Cell, 2024

Current base editors (BEs) use DNA deaminases, including cytidine deaminase in cytidine BE (CBE) or adenine deaminase in adenine BE (ABE), to facilitate transition nucleotide substitutions. Combining CBE or ABE with glycosylase enzymes can induce limited transversion mutations. Nonetheless, a critical demand remains for BEs capable of generating alternative mutation types, such as T>G corrections. In this study, we leveraged pre-trained protein language models to optimize a uracil-N-glycosylase (UNG) variant with altered specificity for thymines (eTDG). Notably, after two rounds of testing fewer than 50 top-ranking variants, more than 50% exhibited over 1.5-fold enhancement in enzymatic activities. When eTDG was fused with nCas9, it induced programmable T-to-S (G/C) substitutions and corrected db/db diabetic mutation in mice (up to 55%). Our findings not only establish orthogonal strategies for developing novel BEs but also demonstrate the capacities of protein language models for optimizing enzymes without extensive task-specific training data.

Recommended citation: Y He, X Zhou, C Chang, G Chen, W Liu, G Li, X Fan, M Sun, C Miao, et al. (2024). "Protein language models-assisted optimization of a uracil-N-glycosylase variant enables programmable T-to-G and T-to-C base editing." Molecular Cell 84(7), 1257-1270.
Download Paper

Protocol to use protein language models predicting and following experimental validation of function-enhancing variants of thymine-N-glycosylase

Published in STAR Protocols, 2024

A step-by-step protocol for optimizing thymine-DNA-glycosylase (TDG) using protein language models, including zero-shot enzyme optimization, plasmid construction, transfection, and high-throughput sequencing—streamlining engineering of gene editing tools.

Recommended citation: Y He, X Zhou, F Yuan, X Chang. (2024). "Protocol to use protein language models predicting and following experimental validation of function-enhancing variants of thymine-N-glycosylase." STAR Protocols 5(3), 103188.
Download Paper

Toward De Novo Protein Design from Natural Language

Published in bioRxiv (under review at Nature), 2024

De novo protein design represents a fundamental pursuit in protein engineering, yet current deep learning approaches remain constrained by their narrow design scope. Here we present Pinal, a large-scale frontier framework comprising 16 billion parameters and trained on 1.7 billion protein-text pairs, that bridges natural language understanding with protein design space, translating human design intent into novel protein sequences. Instead of a straightforward end-to-end text-to-sequence generation, Pinal implements a two-stage process: first generating protein structures based on language instructions, then designing sequences conditioned on both the generated structure and the language input. This strategy effectively constrains the search space by operating in the more tractable structural domain. Through comprehensive experiments, we demonstrate that Pinal achieves superior performance compared to existing approaches, including the concurrent work ESM3, while exhibiting robust generalization to novel protein structures beyond the PDB database.

Recommended citation: F Dai, S You, Y Zhu, Y Gao, L Fu, X Zhou, J Su, C Wang, Y Fan, X Ma, et al. (2024). "Toward De Novo Protein Design from Natural Language." bioRxiv (under review at Nature).
Download Paper

Decoding the Molecular Language of Proteins with Evolla

Published in bioRxiv (under review at Nature), 2025

Proteins, nature’s intricate molecular machines, are the products of billions of years of evolution and play fundamental roles in sustaining life. Yet, deciphering their molecular language - that is, understanding how protein sequences and structures encode and determine biological functions - remains a cornerstone challenge in modern biology. Here, we introduce Evolla, an 80 billion frontier protein-language generative model designed to decode the molecular language of proteins. By integrating information from protein sequences, structures, and user queries, Evolla generates precise and contextually nuanced insights into protein function. A key innovation of Evolla lies in its training on an unprecedented AI-generated dataset: 546 million protein question-answer pairs and 150 billion word tokens, designed to reflect the immense complexity and functional diversity of proteins. Post-pretraining, Evolla integrates Direct Preference Optimization (DPO) to refine the model based on preference signals and Retrieval-Augmented Generation (RAG) for external knowledge incorporation, improving response quality and relevance. To evaluate its performance, we propose a novel framework, Instructional Response Space (IRS), demonstrating that Evolla delivers expert-level insights, advancing research in proteomics and functional genomics while shedding light on the molecular logic encoded in proteins.

Recommended citation: X Zhou, C Han, Y Zhang, H Du, J Tian, J Su, R Liu, K Zhuang, S Jiang, et al. (2025). "Decoding the molecular language of proteins with Evolla." bioRxiv (under review at Nature).
Download Paper

ESM-Ezy: a deep learning strategy for the mining of novel multicopper oxidases with superior properties

Published in Nature Communications, 2025

ESM-Ezy combines the ESM-1b protein language model with semantic space similarity to mine novel multicopper oxidases from UniProt, achieving a 44% success rate in identifying enzymes that outperform query enzymes in catalytic properties, with applications in environmental remediation.

Recommended citation: H Qian, Y Wang, X Zhou, T Gu, H Wang, H Lyu, Z Li, X Li, H Zhou, C Guo, et al. (2025). "ESM-Ezy: a deep learning strategy for the mining of novel multicopper oxidases with superior properties." Nature Communications 16, 3274.
Download Paper

PRIME: A Multi-Agent Environment for Orchestrating Dynamic Computational Workflows in Protein Engineering

Published in bioRxiv, 2025

PRIME is an autonomous multi-agent system that interprets high-level protein engineering objectives and constructs custom computational pathways from a library of 65+ validated protein tools, reducing hallucination by grounding each step in verifiable tool execution.

Recommended citation: Y Zhou, J Su, J Zhang, W Hu, T Tao, G Li, X Zhou, L Fan, F Yuan. (2025). "PRIME: A Multi-Agent Environment for Orchestrating Dynamic Computational Workflows in Protein Engineering." bioRxiv.
Download Paper

Democratizing protein language model training, sharing and collaboration

Published in Nature Biotechnology, 2025

SaprotHub offers an intuitive platform that facilitates training, prediction, storage, and sharing of protein language models. Built on Google Colab, the ColabSaprot framework powers hundreds of protein applications, enabling researchers without deep ML expertise to collaboratively build and share customized models.

Recommended citation: J Su, Z Li, T Tao, C Han, Y He, F Dai, Q Yuan, Y Gao, T Si, X Zhang, X Zhou, et al. (2025). "Democratizing protein language model training, sharing and collaboration." Nature Biotechnology.
Download Paper

A trimodal protein language model enables advanced protein searches

Published in Nature Biotechnology, 2025

ProTrek unifies protein sequence, structure, and natural language function in a trimodal language model through contrastive learning, enabling comprehensive searches between any two modalities. ProTrek surpasses alignment tools such as Foldseek and MMseqs2 in speed and accuracy for identifying functionally related proteins. The ProTrek server (search-protrek.com) provides precomputed embeddings for over 5 billion proteins.

Recommended citation: J Su, Y He, S You, S Jiang, X Zhou, X Zhang, Y Wang, et al. (2025). "A trimodal protein language model enables advanced protein searches." Nature Biotechnology.
Download Paper

talks

Introduction to Protein Language Models

Published:

Research seminar at Bournemouth University introducing how large language models are applied to proteins and their role in computational biology. 伯恩茅斯大学 NCCA 研讨会:蛋白质大语言模型原理与生物学应用入门。

Evolla: Decoding Protein Molecular Language Permalink

Published:

Invited talk for the AI4Protein community on Evolla, a protein-to-text multimodal LLM trained on 546M protein Q&A pairs. AI4Protein 社区邀请报告:Evolla 蛋白质多模态大语言模型与功能推理。

teaching