Joshua Price
Brigham Young University
Talk Title
Protein Redesign Using Fine-Tuned ProtBert Masked Language Models
Presentation Time
SESSION 13: NEW FRONTIERS IN COMPUTATIONAL PEPTIDE DESIGN PART 2
Thursday, June 29, 2023, at 10:50 am - 11:10 am
Machine learning language models have mastered human dialect; models like GPT-2, GPT-3, and the ChatGPT can write sophisticated realistic responses to simple prompts. Other models like BERT can classify text in useful ways for computational and machine-learning work.
Based on the hypothesis that a sequence of amino acids, AAs, can be interpreted like written language, specialized models have been developed for protein-specific tasks. For example, the ProtBERT masked language model, MLM, was trained on protein sequences in the UniProt and BFD databases for "fill-in-the-blank" tasks. Given an input AA sequence, ProtBERT will deterministically replace one AA with one of other canonical AAs based on features extracted from its training data and on context clues from the surrounding sequence. Its current strength is the ad lib "hallucination" of new protein sequences, which are predicted to share many features of known proteins and to explore new areas of sequence space. However, ProtBERT is not currently able to generate a protein with a specified secondary/tertiary structure or function.
We have fine-tuned ProtBERT by training it on selected subsections of the PDB with a characteristic secondary structure. The resulting specialized models, αProtBERT and βProtBERT, reliably generate α-helices and β-sheets, respectively, and should be useful for redesigning selected regions of a protein while accounting for specific context-clues within the surrounding constant sequences, a skill that would be essential for redesigning proteins that have active sites, binding interfaces, or consensus sequences a user does not wish to disturb.
The interests of the Price Research Lab include bioorganic chemistry, protein folding and structure, protein glycosylation's energetic and structural effects, and the design of novel proteins using unnatural amino acids.