Cell2Sentence: Understanding Single-Cell Biology With LLMs

0
321
Cell2Sentence
Cell2Sentence: Understanding Single-Cell Biology With LLMs

C2S-Scale

Imagine being able to ask a cell questions about its condition, activities, or potential reactions to medication and getting a response in simple English. Cell2Sentence-Scale (C2S-Scale), a series of open-source big language models trained to interpret and comprehend biology at the single-cell level, is being introduced today in collaboration with Yale University.

By bridging the gap between biology and artificial intelligence, C2S-Scale transforms intricate cellular data into easily understood “cell sentences.” After that, researchers can ask queries about particular cells, such as “Is this cell cancerous?” or “How will this cell respond to Drug X?” and get concise, natural language responses that are informed by biology.

C2S-Scale may be able to:

  • Accelerate the development and discovery of new drugs
  • Customize treatment to enhance results.
  • Democratize science by making it open-source.
  • Assist researchers in bettering illness prevention, treatment, and understanding

Language-driven single-cell analysis using huge language models has interesting new applications because to C2S-Scale’s thorough investigation of the optimal ways to represent cells and biological information as text.

Single-cell RNA sequencing

Trillions of cells make up each human, and each one has a specific purpose, such as constructing organs, battling infections, or transporting oxygen. No two cells are exactly same, even in the same tissue. Single-cell RNA sequencing (scRNA-seq) measures gene expression to determine what each cell is doing.

However, single-cell data are enormous, complex, and difficult to evaluate. Thousands of numbers can be used to represent each cell’s gene expression measures, which are often analyzed using specialist tools and models. Because of this, single-cell analysis is slow, hard to scale, and only suitable for experienced users.

Imagine being able to translate those hundreds of numbers into language that both language models and people could comprehend. In other words, what if could find out in simple English how a cell is feeling, what it is doing, or how it may react to a medication or illness? Understanding biological systems at this scale, from individual cells to vast tissues, could revolutionize the way research, identify, and treat illness.

Google is thrilled to present Cell2Sentence-Scale (C2S-Scale), a collection of robust, open-source large language models (LLMs) designed to “read” and “write” biological data at the single-cell level, in “Scaling Large Language Models for Next-Generation Single-Cell Analysis” session. The fundamentals of single-cell biology, how cells are converted into word sequences, and how C2S-Scale creates new avenues for biological research will all be covered in this piece.

From cells to sentences

Each cell’s gene expression profile is converted by C2S-Scale into a text string known as a “cell sentence,” which is a list of the most active genes in that cell arranged according to their level of gene expression. This enables the application of natural language models to scRNA-seq data, such as those found in Google Gemini or Gemma models.

Google increase the accessibility, interpretability, and flexibility of single-cell data by utilizing language as the interface. Additionally, a lot of biology is already stated in text, such as gene names, cell kinds, and experimental metadata, therefore LLMs are well suited to process and comprehend this data.

C2S-Scale organizes gene names by expression and creates natural language “cell sentences”
Image credit to Google

Meet the C2S-Scale model family

By using data engineering and thoughtfully crafted prompts that incorporate cell phrases, information, and other pertinent biological context, C2S-Scale expands upon Google’s Gemma open model family and adapts it for biological reasoning. Since the fundamental LLM design has not changed, C2S-Scale can take full advantage of the rich ecosystem, scalability, and infrastructure developed around general-purpose language models. A suite of LLMs trained on more than 1 billion tokens from scholarly literature, biological metadata, and real-world transcriptome datasets is the end product.

The C2S-Scale family of models, which spans from 410 million to 27 billion parameters, was created to satisfy the various demands of the scientific community. Smaller models are more accessible and efficient; they may be deployed or adjusted with less computing power, which makes them perfect for exploratory studies or settings with limited resources. Larger models perform better on a variety of biological functions, but requiring more computing power. By making this range of model sizes available, it enable customers to select the model that best suits their particular use case while taking computation, speed, and performance requirements into account. Every model will be released as open-source, allowing for further development or use.

What can C2S-Scale do?

Chat with biology: Question answering from single-cell data

Imagine someone posing the question, “How will this T cell react to anti-PD-1 therapy, a common cancer treatment therapy?”

Using both the cellular data and the biological information they have seen during pre-training, C2S-Scale models are able to respond in normal language, as demonstrated on the left below. As seen on the right below, this makes conversational analysis possible, allowing researchers to engage with their data in a way that was previously impossible using natural language.

Biological insights with Conversational AI
Image credit to Google

Interpret data with natural language

From characterizing the cell types of individual cells to producing summaries of entire tissues or experiments, Cell2Sentence-Scale can automatically provide biological summaries of scRNA-seq data at various levels of complexity. This enables researchers to more confidently and quickly understand fresh datasets without having to write complicated algorithms.

Scaling laws in biology

Google work’s main conclusion is that biological language models exhibit well-defined scaling rules, with performance improving predictably with increasing model size. From creating cells and tissues to classifying cell types, larger C2S-Scale models typically perform better than smaller ones in a variety of biological tasks.

In the parameter-efficient regime, the found that increasing model size consistently improved semantic similarity scores for dataset interpretation. As the model’s capability grew to 27 billion parameters, the fraction of gene overlap in tissue creation improved dramatically with complete fine-tuning. This pattern is similar to that seen in general-purpose LLMs and highlights a significant finding: biological LLMs will continue to improve with more data and computation, leading to more advanced and broadly applicable tools for biological discovery.

Scaling Laws for Single-cell Analysis
Image credit to Google

Predicting the future of cells

Predicting a cell’s reaction to a perturbation, like as a medication, gene deletion, or cytokine exposure, is one of the most fascinating uses of Cell2Sentence-Scale. The model can produce a new sentence that reflects the anticipated changes in gene expression given a baseline cell sentence and a treatment description.

Drug discovery, tailored therapy, and prioritizing studies before they are conducted in the lab are all accelerated by this capacity to replicate cellular function in silico. The development of realistic “virtual cells,” which have been suggested as the next generation of model systems and may provide quicker, less expensive, and more moral substitutes for conventional cell lines and animal models, is greatly advanced by C2S-Scale.

Optimizing with reinforcement learning

Google use comparable strategies to improve Cell2Sentence-Scale models for biological reasoning, just as reinforcement learning is used to fine-tune big language models like Gemini to follow instructions and respond in useful, human-aligned ways. It train C2S-Scale to produce informative and physiologically accurate responses that are more like the actual answers in the dataset by applying reward functions (like BERTScore) that are intended for semantic text evaluation. In complex activities like simulating therapeutic therapies, this directs the model toward answers that are helpful for scientific discovery.

Try it yourself

Cell2Sentence materials and models are now accessible on websites like GitHub and HuggingFace. It encourage you to play with your own single-cell data, investigate these tools, and discover the potential of teaching robots to comprehend the language of life one cell at a time.