skip to main content

Find a Project

LLM Augmented Bayesian Optimisation for Enzyme Buffer Formulation

Project

Project Details

Program
BioEngineering
Field of Study
Bioengineering, Biochemistry, Computational Biology, or Computer Science
Division
Biomedical Sciences

Project Description

Commercial one step RT master mixes are the workhorse of RNA diagnostics, yet they remain expensive, proprietary, and supply chain dependent. Prior work by Graham et al. 2021 (1) demonstrated that open source one step RT mixes assembled from in-house expressed enzymes are viable alternatives to commercial reagents. This project aims to optimise an open source one step formulation for a novel RT enzyme navigating the buffer space intelligently using an LLM augmented Bayesian optimisation agent.

Moreover, a key cost driver in RT reactions is the recombinant peptide RNase inhibitor. Earl et al. 2017 (2) showed that polyvinylsulfonic acid (PVSA), a small molecule polyanion, inhibits a broad spectrum of RNases at approximately 1,700 times lower cost than commercial protein based inhibitors. This project therefore also evaluates PVSA and related chemical alternatives as cost effective drop in replacements within the buffer screen.

The central question is: does incorporating LLM extracted biological and biochemical reasoning into the Bayesian optimisation prior meaningfully reduce the number of experiments needed to identify an optimal open source RT buffer ad hoc, or with as few iterative cycles as possible, compared with standard Bayesian optimisation or one factor at a time screening alone?

Reactions are miniaturised to 2 µL in a 96 well plate format using nanolitre dispensing and benchmarked by multiplex digital PCR (dPCR), enabling more than 90 formulations per screen across three temperatures.

Compositional Search Space

The optimisation agent maps a multidimensional compositional landscape by probing buffer components drawn from seven additive and supplement categories. Rather than testing categories sequentially as in conventional One factor at a time (OFAT) screening, the LLM augmented prior guides the agent to propose informative combinations across categories from the outset, compressing the number of cycles needed to locate the optimal region of the formulation space.

Computational Architecture

  • LLM agent with tool calling (Claude API). Reads RT enzymology and buffer optimisation literature via RAG, queries protocol databases and journals (protocols.io, bio-protocol.com, STAR Protocols), and produces prior estimates for untested conditions with mechanistic rationales.
  • RAG pipeline (LlamaIndex + ChromaDB). A curated corpus of RT mechanism, polyanion RNase inhibition, and buffer design literature, embedded and indexed for semantic retrieval.
  • Multiobjective Bayesian optimiser (BoTorch, qEHVI acquisition). Gaussian Process surrogate with prior mean informed by LLM predictions; selects maximally informative batches per cycle.
  • Data pipeline. Prepares input file for iDOT liquid dispenser. Parsers for dPCR and plate reader output. Automated ingestion into an experimental history database, closing the loop back to the agent after each plate.

 

Key References

1. Graham TGW, Dugast-Darzacq C, Dailey GM, Darzacq X, Tjian R. Simple, Inexpensive RNA Isolation and One-Step RT-qPCR Methods for SARS-CoV-2 Detection and General Use. Curr Protoc. 2021 Apr;1(4):e130. doi: 10.1002/cpz1.130. PMID: 33905620; PMCID: PMC8206771.

2. Earl CC, Smith MT, Lease RA, Bundy BC. Polyvinylsulfonic acid: A Low-cost RNase inhibitor for enhanced RNA preservation and cell-free protein translation. Bioengineered. 2018 Jan 1;9(1):90-97. doi: 10.1080/21655979.2017.1313648. Epub 2017 Jun 29. PMID: 28662363; PMCID: PMC5972934.


About the Researcher

Fabian Schmidt

Desired Project Deliverables

Surrogate landscape and benchmarking suite. A Gaussian Process model fitted to published RT buffer data, with a reproducible comparison of random search, OFAT, standard BO, and LLM augmented BO across convergence speed and cDNA yield.

 LLM augmented Bayesian optimisation pipeline. A working Python codebase integrating the Claude API (RAG over a curated enzymology corpus), BoTorch multiobjective acquisition (qEHVI), and an automated data ingestion layer that closes the experimental loop after each plate read.

 Experimentally validated open source buffer formulation. At least three candidate 2 µL RT buffer formulations benchmarked by multiplex dPCR across a minimum of two screening cycles, including at least one formulation incorporating a cost effective RNase inhibitor substitute.

 RNase inhibitor cost effectiveness report. A side by side evaluation of PVSA and related polyanions versus a standard recombinant peptide inhibitor, quantifying inhibitory efficacy, impact on cDNA yield, and estimated reagent cost per reaction.

 Convergence report. A quantitative comparison of how many experiments each strategy requires to reach the cDNA yield target, with statistical uncertainty bands from repeated simulated trials and a direct comparison against the BEARmix open source reference.

 Reproducibility package. Electronic lab notebook, plate layout files, and a README sufficient for any laboratory to rerun the optimisation loop and reproduce the experimental screen.


Recommended Student Background

Python scientific stack (NumPy, Pandas, Matplotlib); PyTorch or GPyTorch experience is advantageous but not req
Basic understanding of Bayesian inference or Gaussian Processes (a structured reading list will be provided at the start of the internship).
Comfort with command line tools and version control (Git).
Interest in LLM APIs and prompt engineering; prior experience with the Anthropic or OpenAI API is a plus.
Lab experience with pipetting and plate based assays is helpful but not mandatory; full training on the nanolitre dispensing platform will be provided.
Soft skills. Systematic record keeping, clear written communication, and the ability to work independently as well as collaboratively during weekly check ins with the supervisor.