
Proteins, the natural molecules that carry out key cellular functions within the body, are the building blocks of all diseases. Protein characterization can reveal the mechanisms of a disease, including ways to slow or potentially reverse it, while creating Proteins may lead to entirely new classes of drugs and therapies.
But the current process for designing proteins in the laboratory is expensive, both from a computational and human resource point of view. It involves devising a protein structure that can plausibly perform a specific task within the body and then find a protein sequence (the sequence of amino acids that make up a protein) that is likely to “fold” into that structure. (Proteins must fold correctly into three-dimensional shapes to carry out their intended function.)
It doesn’t necessarily have to be that complicated.
This week, Microsoft introduced a general-purpose framework, EvoDiff, which the company says can generate “high-fidelity” “diverse” proteins given a protein sequence. Unlike other protein-generating frameworks, EvoDiff does not require any structural information about the target protein, eliminating what is often the most labor-intensive step.
Available open source, EvoDiff could be used to create enzymes for new therapies and drug delivery methods, as well as new enzymes for industrial chemical reactions, says Microsoft principal investigator Kevin Yang.
“We envision EvoDiff will expand capabilities in protein engineering beyond the structure-function paradigm toward programmable, sequence-first design,” Yang, one of EvoDiff’s co-creators, told TechCrunch in an email interview. “With EvoDiff, we are showing that we may not actually need structure, but rather ‘the protein sequence is all we need’ to design new proteins in a controllable way.”
The core of the EvoDiff framework is a 640-parameter model trained with data from all the different species and functional classes of proteins. (“Parameters” are the parts of an AI model learned from training data and essentially define the model’s ability at a problem, in this case generating proteins.) Data to train the model was obtained from the OpenFold dataset for sequence alignments. and UniRef50, a subset of data from UniProt, the database of protein sequences and functional information maintained by the UniProt consortium.
EvoDiff is a diffusion model, similar in architecture to many modern imaging models, such as Stable Diffusion and DALL-E 2. EvoDiff learns how to gradually subtract noise from an initial protein composed almost exclusively of noise, bringing it, slowly, step by step, closer to a sequence of proteins.

The process by which EvoDiff generates proteins.
Diffusion models have increasingly been applied to domains outside of imaging, from creating designs for novel proteins, such as EvoDiff, to creating music and even speech synthesis.
“If there is something to take away [from EvoDiff]”I think it would be this idea that we can (and should) generate proteins instead of sequencing because of the generality, scale and modularity that we can achieve,” said Microsoft principal investigator Ava Amini, another EvoDiff contributor. , she said via email. “Our diffusion framework gives us the ability to do that and also control how we design these proteins to meet specific functional goals.”
As Amini points out, EvoDiff can not only create new proteins but also fill the “gaps” in an existing protein design, so to speak. Whenever a part of a protein binds to another protein, the model can generate a protein amino acid sequence around that part that meets a set of criteria, for example.
Because EvoDiff designs proteins in “sequence space” rather than protein structure, it can also synthesize “disordered proteins” that do not end up folding into a final three-dimensional structure. Like normally functioning proteins, disordered proteins play important roles in biology and disease, such as enhancing or decreasing the activity of other proteins.
Now, it’s worth noting that the research behind EvoDiff has not been peer-reviewed, at least not yet. Sarah Alamdari, a Microsoft data scientist who contributed to the project, admits that there is “a lot more scaling work” to be done before the framework can be used commercially.
“This is only a 640 million parameter model, and we may see better generation quality if we scale up to billions of parameters,” Alamdari said by email. “While we demonstrate some general strategies, for even more granular control, we would like to condition EvoDiff on text, chemical information, or other ways to specify the desired function.”
As a next step, the EvoDiff team plans to test the proteins the model generated in the lab to determine if they are viable. If this turns out to be the case, they will start working on the next generation of the framework.