My research builds protein foundation models and multimodal systems that connect amino-acid sequences, 3D structures, and human-readable functional language—from pretraining and platforms to search, dialogue, and design.

Protein Language Models

Structure-aware pretraining, open platforms, and PLM-driven protein engineering

ICLR 2024

SaProt

General-purpose PLM with a structure-aware vocabulary—residue tokens fused with Foldseek structure tokens—trained on ~40M sequence–structure pairs.

Nature Biotechnology 2025

SaprotHub

No-code platform on Google Colab for training, sharing, and collaborating on protein ML models—democratizing PLM access for biologists.

Molecular Cell 2024

PLM-guided eTDG

PLM-guided optimization of uracil-N-glycosylase enables programmable T→G/C base editing with few-shot experimental validation.

Nature Communications 2025

ESM-Ezy

ESM-1b semantic mining discovers multicopper oxidases with superior catalytic and environmental-remediation properties from UniProt.

Protein–Text Multimodal Intelligence

Aligning sequence, structure, and natural language for search, dialogue, and de novo design

Nature Biotechnology 2025

ProTrek

Trimodal contrastive model unifying sequence, structure, and function text—advanced cross-modal protein search at billion-protein scale.

bioRxiv · under review at Nature

Evolla

Interactive protein-language model that answers natural-language queries over sequence and structure—generative functional discovery beyond static annotation.

bioRxiv · under review at Nature

Pinal

16B-parameter framework: natural-language instructions → structure generation → sequence design for de novo proteins beyond PDB.