My research builds protein foundation models and multimodal systems that connect amino-acid sequences, 3D structures, and human-readable functional language—from pretraining and platforms to search, dialogue, and design.
Protein Language Models
Structure-aware pretraining, open platforms, and PLM-driven protein engineering
SaProt
General-purpose PLM with a structure-aware vocabulary—residue tokens fused with Foldseek structure tokens—trained on ~40M sequence–structure pairs.
SaprotHub
No-code platform on Google Colab for training, sharing, and collaborating on protein ML models—democratizing PLM access for biologists.
PLM-guided eTDG
PLM-guided optimization of uracil-N-glycosylase enables programmable T→G/C base editing with few-shot experimental validation.
ESM-Ezy
ESM-1b semantic mining discovers multicopper oxidases with superior catalytic and environmental-remediation properties from UniProt.
Protein–Text Multimodal Intelligence
Aligning sequence, structure, and natural language for search, dialogue, and de novo design
ProTrek
Trimodal contrastive model unifying sequence, structure, and function text—advanced cross-modal protein search at billion-protein scale.
Evolla
Interactive protein-language model that answers natural-language queries over sequence and structure—generative functional discovery beyond static annotation.
Pinal
16B-parameter framework: natural-language instructions → structure generation → sequence design for de novo proteins beyond PDB.