How to Turn Biology into a Language

In March 1960, J.C.R. Licklider published “Man-Computer Symbiosis,” a paper that would prove to be both foundational and groundbreaking for computer science. In it, he distinguished symbiosis from the “mechanically extended man,” a much more familiar paradigm at the time. In a future symbiosis, he writes, “Men [sic] will set the goals, formulate the hypotheses, determine the criteria, and perform the evaluations. Computing machines will do the routinizable work that must be done to prepare the way for insights and decisions in technical and scientific thinking.” This relationship, he predicted, would run laps around human-only efforts.

It didn’t take long for the computer revolution to ripple into biology and chemistry. Five years after Licklider’s paper, scientists at Stanford created Dendral, the first so-called expert system meant to routinize organic chemistry. Ten years after that, researchers created MYCIN, software to identify bacteria causing severe blood-borne infections.

Artificial intelligence has matured in parallel with biology for decades. If you ask Ali Madani and Viswa Colluru, the two fields will transform how we discover drugs and treat disease.

Madani founded Profluent, a company that uses machine learning to understand and develop new functional proteins; Colluru founded Enveda Biosciences, which uses AI to find new drugs within “nature’s chemistry” of undiscovered metabolites.

Madani was a senior research scientist at Salesforce when he first experimented with large language models (predecessors to software like OpenAI’s GPT-3.5 and 4, which power ChatGPT), and started to see how computers were becoming increasingly fluent in human languages. He wondered what other data one could feed these systems.

“We can’t just read amino acids strung together,” he says. AI could. Earlier this year, his team reported “writing” proteins with AI in the journal Nature Biotechnology. “Over 50% of them were functional. And the functionality rivaled that of industry standard proteins that everyone uses, that have had millions of years of evolution to evolve and become really great,” he says. “That was the point for me where I thought there’s something powerful here.”

If Madani’s LLM speaks protein biology, Colluru’s speaks organic chemistry. “I grew up in India, steeped in the culture where alternative medicine is not alternative, it’s just medicine. As a kid I had many bouts of nausea, headaches, and other symptoms cured by some plant that my grandmother gave me,” he says. Enveda’s algorithms translate the output from mass spectrometers into 2D chemical structure in order to discover the therapeutic molecules in plants.

Madani and Colluru recently spoke to Grow about their work, and about how biologists should think about the AI revolution in drug development. “The fundamental problem in drug discovery is that things that work in the lab don’t work in people,” Colluru says. “If you take the most naive, childlike approach to that problem statement, you can ask: Well, are there things that do work in people that we’ve not paid attention to?”

The two founders discuss innovation, leadership, and challenges—both technical and ethical. Madani and Colluru believe LLMs will catapult our scientific reach to unprecedented heights. We may only be beginning to see this play out like echoes of Licklider’s man-computer synbiosis.

In that 1960 paper, Licklider refers to a 15-year window before a sort of machine supremacy. “There will nevertheless be a fairly long interim during which the main intellectual advances will be made by men and computers working together in intimate association,” he wrote. “The 15 may be 10 or 500, but those years should be intellectually the most creative and exciting in the history of mankind.”

So much of how we actually understand biology stems from the metaphor of DNA as a language or a code—a genome is the “book” of life, cells perform “information processing” to “express” genes. At the same time, computer science pulls metaphors from biology—neural networks, learning, intelligence. What metaphors help you understand the work you do?

Colluru: I work in chemistry and use biology as an analogy. We know less than 1% of the chemical structures and their functions that exist in nature and when we started this work, we didn’t have the technology to be able to fully characterize this. While it’s become easier and easier to sequence the genetic code of an organism at the scale of whole genomes, the only way that we knew how to “sequence” the chemical code of any organism was to do it one “base pair” at a time—isolate a single compound, do an NMR and some low-throughput biology experiments, isolate a single compound again and do it all over again. In other words, you had to actually put each in different test tubes and test what its structure and function is.

When we do things at the “genome” scale for the chemistry of a cell, we can use mass spectrometry to look at the whole metabolome—every chemical in the cell. Mass spectrometry is a fancy way of taking a molecule and shattering it to capture the mass and charge of the fragments that they break into. Every molecule has a unique signature of those fragments. In terms of analogies, that signature of masses and ionic charges of the fragments represents a chemical molecule much like words arranged to form a sentence represent meaning in a human language. Just as context matters for grammar and meaning in language, it does in mass spectrometry as well—for understanding how those fragments translate back to the structures that had to be painstakingly decoded.

Madani: I’m not a biologist or chemist by training — I’m actually a machine learning scientist — but I’ve always been fond of the complexity that biology has, and the rich tapestry that nature has provided us. I think of the information within functional protein sequences, and the order and structure of the amino acid, as “letters.” You really need sophisticated deep learning techniques to be able to capture the full scale of richness that the information provides us—information we’ve gotten through evolution.

I actually thought it would be just that initial analogy that inspired me to look at these complex data sets. But we ended up publishing multiple papers here on this topic of BERTology—where we look into the inner workings of these natural language processing models and uncover that the models are specifically learning grammatical structure. They’re learning to reference these concepts that we know to be true within the latent structure of the English language. Similarly, when we look at these protein language models, we also uncover fundamental biophysical principles as well—a grammar for proteins. So the analogy is not just inspiration.

Just like the previous models trained on human language and learned to actually write English, the question now is how do we actually steer biology? How do we actually go beyond discovery and move into design? How do we actually design the protein sequences that are going to be functional in the real world?

You both mention grammar and context. Why are those so important to capture?

Colluru: Context is so important for understanding human language. If I give you the sentence:

The animal didn’t cross the street because it was tired.

I have no idea what the word “it” means unless I read the word tired. And if I change the last word to crowded, now the word “it” means street—the animal didn’t cross the street, because it was crowded. And if I change the last word to raining, now the word “it” means weather.

It turns out that this context dependence is very similar to how molecules break apart in a mass spectrometer—the same peak in a spectrum means a different chemical structure depending on the peaks that surround it. So just like you can train an algorithm to learn English grammar and predict missing words in a sentence complete with context, or learn protein “grammar” and predict missing amino acids in a protein sequence, you can also train it to predict missing peaks in a mass spectrometry output and learn chemical grammar.

Ali [Madani] is using that kind of learning to create “grammatically correct” proteins that don’t yet exist. We’re doing that slightly differently. We want to translate the mass spectrum into a representation of chemical structure that humans can understand. Using LLMs to build the Google Translate for chemistry.

You spoke about teaching an AI model how to write proteins and how to translate chemical structures: can you explain a bit more on how they learn? What does it actually look like to teach a computer how to speak a language?

Madani: What really enabled these advances is unsupervised learning and self supervision. How do we learn as babies? I have a one-and-a-half year old, and she’s learning how to speak. We don’t learn by going to school, necessarily, and learning all the grammatical rules that enable us to speak—we learn through examples. We hear our mothers and fathers, our caregivers. We hear people all around the world speaking these languages and then we uncover underlying associations and patterns.

Colluru: These models are effectively a next word prediction algorithm. They get trained by taking hundreds of millions of sentences — in some cases, billions of sentences — and masking a word. So, using the same example:

The animal didn’t cross the street because _____ was crowded.

And the model is trained to fill in all grammatically correct words like “it.” And the beauty of this is that it doesn’t require a labeled dataset, where you know what the answer is—you only need hundreds of millions of sentences that are grammatically correct.

So now you have a model that just knows what words occur next to other words. So “bank” occurs next to “money” and next to “river” but never next to “Viswa.” It just learns that these words co-occur. And its ability to predict the word based on not just the word before it, but based on lots of words before it is what makes it context-aware.

Madani: What other researchers in the field quickly realized is that for a given task, we don’t weigh all of the information equally. This is the concept of attention; we don’t pay attention equally to everything we encounter.

Selective “attention” and context matters. And that’s a powerful way basically to effectively learn. That mechanism enabled the rise of the transformer.

And this is not just in natural language processing. It’s also in computer vision. In terms of what we are able to identify as objects, these cars all kind of look the same. These chairs have specific features, for example. Then whenever someone actually points at one and says “car,” I’m able to extrapolate, because I’ve already uncovered some associations of all the data that was presented in front of me previously. I don’t need to go to school and see those samples over and over and over again to uncover what makes something a car. I’ve already kind of uncovered that in an unsupervised and self supervised manner, and that’s the scalable way in which we can learn really powerful representations of data.

So this unsupervised learning without labeled data is the foundation that drove this whole field forward.

Models like GPT are trained on these huge libraries of text that humans generated–stuff that people wrote. Humans didn’t write the data that your models are trained on, but you must curate and shape that data to work for your model—what does that process look like?

Madani: If I train a massive model on just YouTube comments, I’m going to get a terrible model. You want to curate it effectively. There is a human aspect to building these AI systems.

A former colleague of mine referred me once to a really great essay by Peter Norvig, [Alon Halevy, and Fernando Pereira at Google] called “The Unreasonable Effectiveness of Data.” The TLDR is to effectively use these powerful algorithms, we should really be following the information-rich data sources that exist. And when I was thinking about proteins, I knew that with the dramatic reduction of DNA sequencing costs, we have orders of magnitude larger protein sequences that we’ve collected and observed—it’s just waiting to be learned. This pile of data that’s just sitting there for us to really capture and uncover these underlying principles.

And similar to grammatical structure, for example, with protein sequences, we can learn co-evolutionary information, like how residues are in contact with each other and predict binding sites. There’s all of these underlying biophysical principles that decide, like, why an amino acid will be next to another amino acid or allosteric interactions that exist, for example, how a change in one amino acid would influence the greater conformational states of the protein as a whole, for example.

So the sequence datasets are just sitting there. It’s noisy. It’s difficult to extract. There’s a lot of bioinformatics work and heavy lifting that’s required to curate it. And it may be biased—a lot of genomic databases may be biased toward bacteria. Whether you’re operating within programming languages like code, or natural languages where you’re just scraping the internet, or proteins, or chemistry or otherwise, you always have to be thinking about how you’re curating that data.

Colluru: Unlike tools for DNA sequences, there weren’t even databases for mass spec in the cloud. People used to just treat mass spec as lines on a PDF graph and try to figure out what it was. So we had to take one step backward. We basically had to build the world’s first cloud scale databases. Pieter Dorrestein, our scientific co-founder at UC San Diego, built a social network for natural products that he calls GNPS, Global Natural Product Social Molecular Networking, that is the world’s largest repository for raw mass spec data. People can throw these chemical sequences or fragment sequences on the internet, and actually see what they’re related to. So this was the first store of data that we could create.

Now, what makes LLMs work? You need a lot of “sentences,” if you want to translate mass spec to chemical structure, then you need some examples in both languages that someone has already translated.

What does this look like in chemistry? On top is a mass spectrum, the pattern in which a molecule breaks apart in a mass spectrometer. And the bottom is essentially the meaning, or the molecule that it refers to. Animation courtesy of Enveda.

The beauty of this is that—just like in a sentence where one word can change the meaning of the other words and the meaning of the sentence—if you add a fragment or set of fragments, it changes the meaning of all the other fragments around it, including the meaning of the set itself. You’re not just making a bigger molecule. As you add more fragments, you’re changing the type of molecule.

Our training data is effectively hundreds of millions of fragments (the sentences of chemistry), hundreds of millions of SMILES strings, and about 50,000 matched pairs between the two in the form of about 1 million spectra.

Where ChatGPT can expertly translate an English poem into Hindi, what Enveda is doing translates the mass spectrum into a language that human chemists can understand.

I’m going to take the complete opposite position here. I think it would be unethical to not pursue AI.

On top of that curation of data to train the foundation model, you also need labeled data around particular applications when you want to ask certain types of questions. What sorts of application data are you looking at?

Colluru: We use labeled data to learn and prioritize the “drug-like” molecules within a sample. The problem historically has been that you take a sample, you do painstaking isolation, you do the NMR experiment that sequences one unit of the chemical code, you find a molecule, and 99% of the time, you don’t like what you’ve found from a medicinal chemistry perspective. There’s lots of needles in nature’s haystack. But the haystack is huge.

So you think of this as a magnet, if we can actually look chemically at thousands of molecules at once and pinpoint the ones that are drug-like, our chemists can focus all their efforts on actually finding that and working on that one molecule to turn it into a medicine. Our approach allows us to shorten that drug discovery timeline by working on the right molecule in terms of chemical structure and properties. It took about 170 years to go from the willow bark extract to acetyl salicylic acid, which is aspirin. If we knew to look for salicylic acid instantly, we’d probably do that in a week.

Madani: It comes down to the data that we extract from the wet lab. So we test a sequence. And we have a property that we measure, whether that’s a binding affinity, a catalytic efficiency, thermostability value. Inevitably, that’s going to pale in comparison to the billions of protein sequences that we train our models on. Our unlabeled models were trained on 5 trillion tokens, as a comparison, GPT3 was trained on 500 billion. The scale of data that we have, and the diversity of data that we train our models with is mind bogglingly large, which is credit to nature. That will always outcompete what is possible for labeled data. But that’s the pressing challenge that we’d like to continue to push on: being able to use as few labeled data as possible to really drive home results.

Photo courtesy of Enveda

This issue of Grow is about scale, a topic that is obviously very important in developing these models since they both require a massive scale of data for training, and they’ve suddenly made it possible to perform many tasks at a bigger scale. What limitations to scale do you think about today?

Madani: We are very excited to go beyond the haystack that nature has provided us. We’re creating novel haystacks and novel needles within the haystack. Nature has really great samples that we can learn from. We can interpolate between those samples, butut really we’re trying to extrapolate beyond — for functions that are further and further away. The more you extrapolate further away, the more challenging it becomes.

The biggest challenge and need for scale now is having a tight integration with the wet lab. It’s not just a practical consideration of ‘we want more data from the wet lab.’ You need good alignment on the team in order to get the model integration that you’re after. A lot of time can be wasted as a startup, when there’s a lack of alignment. As a founder, you can never spend too much time on getting people aligned on what the mission is, what the goal is, and being able to speak the same language.

Something I’m thinking about very deeply, as we also scale the team, is how every person we bring to the team specifically influences the shape and trajectory of the company, and the impact that we can have. The partners we work with as well, they shape the identity of our mission. There’s only so much we can do by ourselves, and finding right partners that are as ambitious, that are excited about really leveraging these novel tools and capabilities to tackle the most pressing problems—that’s something that’s front and center for us.

Another good analogy—alignment.

Colluru: From Enveda’s perspective, similar to the quality of extrapolation is the quality of translation. If you have “words” that the model never saw, it’s probably going to be bad at translating into those or out of those. If 99.9% of chemistry is unknown, then predicting those molecules is going to be hard.

That brings me to the second point—which is also very similar to Ali’s second point—whatever algorithmic advantage that Enveda or maybe Profluent has will disappear. It will be commoditized. And it will be people that have the best quality data, the most quantity data, and the most relevant context data are going to win. And so ultimately, I think we have to use our algorithmic advantage to get a data collection advantage.

When it comes to AI and ethics, that question of ‘what does it mean for the people who historically do the work we’re asking of the AI?’ has been front and center, but there are of course many other questions that emerge: about intentions, unintended consequences, biases in data, and even existential threats if some superintelligence that isn’t aligned with humanity and knows how to make killer viruses? What kind of ethical questions do you think about in your work?

Colluru: I’m going to take the complete opposite position here. I think it would be unethical to not pursue AI. Being able to interpret and harness the world better because of technology has only led to better outcomes at the individual level, societal level, and species level.

Smarter is better. And this is a tool to make us smarter, and perhaps at a scale that we didn’t quite imagine with any one linear model technology. And that’s why it’s more scary.

So, how do you prevent bad stuff from happening? One: being aware that there could be unintended consequences, which by definition cannot be neutralized by you. So then that comes down to building in the open. Talking about your work. Being transparent about what the data is, where it comes from, what your models can do or can’t do. Actually having an open discussion about it so that we can go in sort of eyes wide open.

On the flip side, if you tried to stop this technology, there would be segments or actors that won’t stop working on it. And then you end up with an asymmetry of knowledge and risks.

Madani: It’s very important for us as creators of new algorithms, and as disseminators of novel molecules, to really be thinking deeply about these questions. It’s really important to demystify what other folks are building and what’s happening here—and the implications. As opposed to chat bots that you can release out into the wild on the internet, drug discovery has regulatory bodies in place. So in some ways I feel almost more at ease because of these safeguards, but it doesn’t mean that it’s all solved.

We spoke about language models, learning the language of biology or chemistry. There’s another language that has been emerging—just being able to speak an internal language, interdisciplinary language. This is what really excites me to come to work. We bring folks from so many different backgrounds together. Machine learning scientists, bioinformaticians, cell biologists—each one comes with its own jargon and mode of thinking and really language. What we’re trying to create at Profluent is a shared language across these multiple disciplines so we can push the field forward. It relates to ethics, because instead of kind of clamping down and closing up, it’s about opening up, basically, and being driven by curiosity and good faith.

On X the other day, someone was meditating on the concept of AI superintelligence, basically suggesting that in some sense there is a plateau of intelligence if the models are trained on words that humans have written—you can only learn as much as humans have created if what you’re doing is ingesting language written by humans.

On the other hand, at Ginkgo we say we didn’t invent biology, biology invented us. You’re training models on biological “language” that humans curated but don’t necessarily know how to fully “speak.”

So people today can suggest perhaps AI won’t be as creative or effective as the very best human experts along some dimensions for many types of human work when it comes to speaking English, but perhaps these models will or already have outpaced our ability as scientists to speak biology or chemistry—to steer biology beyond evolution, or to uncover what has evolved but hasn’t been accessible to human scientists? What does that mean for scientists?

Madani: I’m not a proponent of AGI. I don’t feed into the overhyping of AI taking over, because that’s not a productive use of my time, honestly. And I’m just so excited about the application of AI for real tangible value within biology and affecting our lives in a meaningful way.

Some of that is fear associated with “Is it gonna replace my job? Is it gonna replace humans as a whole?” I think it’s about augmentation. Super-powering our workflows to get 10x gains in terms of speed and efficiency or greater.

So I think that the map of metabolite chemistry and metabolism that we will unlock using our tools will lead to one of the most fundamental reimagining of how we think about biology and chemistry — redefining health and disease.

Colluru: It doesn’t matter whether something is AI-made or not. It really matters what doing something better, faster, bigger, or cheaper enables you to do and why you believe that doing that is the problem to solve in drug discovery.

If you can use AI to create a molecule to target an undruggable protein, your drug is only as good as two conditions: Whether you can you make it—most of the AI generated chemistry stuff you can’t actually make in the lab to test—and whether your hypothesis about your target is right.

At Enveda, we use AI to search what we believe is the most powerful chemical library on the planet, we believe that that library will lead to better medicines faster, because it’s spent 4 billion years in convergent evolution, every one of these molecules was made in a cell, by a protein for a protein. This is effectively life’s chemistry, and perhaps a very rich hunting ground for molecules. That’s why we believe AI will change the success rate of medicines and have an impact, not because their LLMs are cool.

The more tools we have, the more maps of biology and chemistry you can build.

What do you mean by maps?

We’ve built one large map of biology by sequencing the genome and we’ve sort of forgotten everything else, including biochemistry—as if all of biology is genetics, genetics is biology. But that’s like saying, I’m going to measure the heights of buildings in New York City, and call that New York City.

One map that we’ve completely ignored is the metabolism. Since we’ve all started this conversation, every one of our nearly trillion cells have done a billion chemical reactions to drive the thermodynamics of bond making and bond breaking. To actually use that energy to grow and stay alive. The most provocative visual I can paint for you is you can take a cell and kill it, and in the instant that it has died, it has the same genome transcriptome and proteome. What it has stopped doing is the dance of life: millions of reactions pushing chemicals through to other forms and living off the energy.

So I think that the map of metabolite chemistry and metabolism that we will unlock using our tools will lead to one of the most fundamental reimagining of how we think about biology and chemistry — redefining health and disease. So that is a contrarian opinion that we hold, that this will herald a more successful wave of therapeutics than the genomic era.

I think 10 years from now, we’ll look back at what’s happening today and feel that we grossly underestimated how our life and our world would change.

How to Turn Biology into a Language

AI is helping us create languages out of proteins and other molecules. Grow’s Christina Agapakis speaks with two founders about how that will shape the medicine of the future.

You both mention grammar and context. Why are those so important to capture?

You spoke about teaching an AI model how to write proteins and how to translate chemical structures: can you explain a bit more on how they learn? What does it actually look like to teach a computer how to speak a language?

Models like GPT are trained on these huge libraries of text that humans generated–stuff that people wrote. Humans didn’t write the data that your models are trained on, but you must curate and shape that data to work for your model—what does that process look like?

On top of that curation of data to train the foundation model, you also need labeled data around particular applications when you want to ask certain types of questions. What sorts of application data are you looking at?

This issue of Grow is about scale, a topic that is obviously very important in developing these models since they both require a massive scale of data for training, and they’ve suddenly made it possible to perform many tasks at a bigger scale. What limitations to scale do you think about today?

Another good analogy—alignment.

On the other hand, at Ginkgo we say we didn’t invent biology, biology invented us. You’re training models on biological “language” that humans curated but don’t necessarily know how to fully “speak.”

What do you mean by maps?

Keep reading

One Shot

Vaccinating Crops

Personalized Medicine Can’t Scale Without AI