Composition-space grid: brute-force scatter vs a few AI-targeted experiments

Artificial IntelligenceMachine LearningMaterials Science

The $78,000 Way to Miss the Answer: Why We Build Self-Driving Labs Instead of Bigger Screening Campaigns

Ashutosh Singhal May 18, 202613 min read

The first self-driving lab I ever helped design didn't fail because of the AI. It failed because of a liquid handler.

We had spent weeks building the part everyone gets excited about — a Bayesian optimization loop wrapped around a graph neural network that could predict the bandgap of a halide perovskite in milliseconds instead of waiting days for a furnace. In simulation it was beautiful. The acquisition function picked compositions like a chess engine sees moves. And then we walked into the actual lab to connect it to the actual instruments, and the whole thing stopped dead in front of a twelve-year-old Hamilton liquid handler whose method files refused to talk to anything that wasn't Hamilton's own software.

That was the day I understood what a self-driving lab actually is. It is not a clever model. It is a closed loop — design an experiment, run it on real hardware, read the result, decide the next one, repeat — and the loop is only as strong as its weakest, most boring link. Usually that link is a piece of equipment from 2014 that was never meant to be automated by anyone but its vendor.

Everyone wants to talk about the optimizer. The thing that decides whether your lab becomes autonomous is whether your fifteen-year-old spectrometer will return a number to a Python process at 3 a.m.

I want to walk through why high-throughput screening — the workhorse of industrial R&D for thirty years — is now mathematically obsolete, what an AI-directed lab does instead, and the unglamorous engineering that decides whether one of these systems ever actually works in your building. This is the problem we now build self-driving labs around, and most of what makes them hard is nothing like what the conference talks suggest.

Why Is High-Throughput Screening Suddenly Obsolete?

Start with a number that should bother any head of R&D. The count of drug-like small molecules that obey Lipinski's rules — the basic chemistry of "this could be a pill" — is estimated at 10⁶⁰. A large high-throughput screening campaign, the kind that costs millions in robotics and reagents, tests around 10⁶ compounds.

Do the division. You are sampling roughly 0.000000000000000000000000000000000000000000000000000001% of the space. Push into biologics and multi-element alloys and the space stretches toward 10¹⁰⁰, which is more than the number of atoms in the observable universe.

The deeper problem isn't even the coverage. It's the assumption underneath it. Screening presumes the answer already sits in a pre-synthesized library on a shelf. For a genuinely novel material — a lead-free perovskite, a solid-state electrolyte, a new MOF — the optimal composition has almost certainly never been made by anyone. You are not searching a haystack for a needle. You are searching a haystack the size of the Pacific for a needle that you still have to forge yourself.

This is the engine under Eroom's Law — Moore's Law spelled backwards — the well-documented fact that pharma R&D productivity has fallen for decades even as spending climbed. Drug development now runs north of $2 billion per asset (Deloitte, 2024), roughly 90% of candidates fail in clinical trials, and pharma's internal rate of return cratered to a twelve-year low of 1.2% in 2022 before clawing back to 5.9% in 2024, largely on the strength of a few GLP-1 outliers. Brute-force search is a tax on every one of those numbers.

What Brute Force Actually Costs in a Real Budget

Comparison: traditional screening 520 experiments/$78,000 vs self-driving lab 80-120/$12-18K

Here's how the obsolescence shows up in a real budget, because abstractions about 10⁶⁰ don't move anyone.

Picture a mid-size materials lab chasing a lead-free halide perovskite with a specific bandgap and stability profile for next-generation solar cells. Five candidate cations, eight anion combinations, continuous stoichiometry ratios — call it 10⁸ viable compositions. The traditional method is a postdoc synthesizing three to five compositions a week, guided by literature and an adviser's intuition. At about $150 per synthesis once you count precursors, substrate prep, and characterization — a line item I've watched accumulate on more than one purchase order — that's $78,000 over a year to test 520 compositions.

Five hundred and twenty out of a hundred million. That's 0.00052% of the space, and the best candidate it surfaces may sit nowhere near the real optimum.

On the pharma side the same arithmetic plays out against the clock instead of the bench. I've watched the identical loop logic — predict first, synthesize only the candidates worth synthesizing — compress preclinical work that traditionally runs four to five years into something closer to the cycle Exscientia ran to put an AI-designed molecule into Phase I in roughly twelve months, with preclinical R&D costs down 25 to 50%. Same principle, far larger denominator.

Now run the same problem as a self-driving lab. You pre-train a graph neural network surrogate on 50,000 DFT-calculated perovskite structures from the Materials Project — a public database of computed material properties — so the model can predict bandgap and formation energy in milliseconds. Then a Bayesian optimizer using an Expected Improvement acquisition function picks each next experiment by one of two criteria: either the predicted performance is high, or the model's uncertainty there is large enough to be worth resolving. It deliberately avoids the hundreds of runs that would have produced incremental or useless data.

That system reaches the top 0.1% of the composition space in 80 to 120 targeted experiments, for $12,000 to $18,000 in reagents. Same lab, same instruments, same postdoc — an order of magnitude fewer experiments and a fraction of the cost.

The win isn't running experiments faster. It's never running the four hundred experiments that were always going to tell you nothing.

That 10–50x reduction in experiments-to-target is not a marketing figure I invented; it's the consistent gap between Bayesian optimization and random screening across the materials literature. And the cost side has its own number — Cost-Informed Bayesian Optimization, published on ChemRxiv in 2024, cuts optimization cost by up to 90% by treating the price of each experiment as part of what the acquisition function optimizes, not an afterthought.

Why I Stopped Trusting the Model and Started Worrying About the Plumbing

Closed-loop self-driving lab diagram with SiLA 2 integration layer flagged as 80% of the work

Back to that Hamilton handler.

When our simulation-perfect loop died on contact with real hardware, my first instinct was the wrong one. I assumed it was a one-off — bad luck with one old instrument. So we wrote a quick adapter, got it limping, and moved on. Then the next lab had a Tecan running FluentControl, the one after that had Agilent instruments speaking a third dialect, and a LIMS and an ELN that stored the same experiment in two incompatible formats. Every site, the model was fine on day one and the integration ate the weeks that followed.

That's when it landed: in a self-driving lab, the AI is maybe 20% of the work. The other 80% is making heterogeneous, often ancient instruments behave like a single programmable system. Labs quietly survive this today by turning people into human middleware — a scientist re-keying numbers from a spectrometer into a spreadsheet into the LIMS — which is exactly the manual, error-prone step automation was supposed to remove.

The honest fix is SiLA 2, the lab-automation standard meant to give instruments a common language. But "standard" oversells it. Each instrument combination is its own integration project. So we started hand-writing SiLA 2 driver stacks to wrap specific legacy instruments — the twelve-year-old handler included — and treating that driver layer as a first-class deliverable, not glue code to be apologized for. It is the part that decides whether the lab is autonomous or just a demo.

There was a second technical humbling waiting on the model side, too. The first time a stoichiometry search blew past a couple dozen tunable parameters, our textbook Gaussian-process optimizer slowed to a crawl. Standard GP-based Bayesian optimization scales as O(n³); somewhere above ~50 dimensions it simply stops being usable, and you have to move to sparse approximations — SVGP, deep kernel learning — that most off-the-shelf tooling doesn't expose. And the GNN surrogate that looked so smart in the perovskite example? Cold, it's worthless. It needs 500 to 1,000 DFT-calculated structures before its predictions beat a coin flip. The cold-start problem isn't something you solve by picking a better model; it's a transfer-learning problem you solve with domain expertise about which pre-trained chemistry to fine-tune from.

None of that is in the brochure. All of it is what the project actually is.

Who Else Builds Self-Driving Labs — and What Do They Quietly Cost You?

The self-driving lab field consolidated fast, and the players are real and well-funded. It's worth knowing what each actually gives you before you sign anything.

Radical AI raised $55M in a seed-plus round and a $60M Series A, backed by RTX Ventures and NVIDIA's NVentures, and opened a Brooklyn Navy Yard facility in January 2026 with Governor Hochul on hand — they screen billions of compositions and run upwards of 100 experiments a day, north of 25 alloys daily. Impressive, and tuned for metallurgy. The catch is structural: your data lives on their stack, and the optimization logic is their black box, not yours to modify. Emerald Cloud Lab runs 200+ automated instruments at Carnegie Mellon — you ship samples and get results back — which is genuinely useful and also means your proprietary chemistry leaves your premises and you're limited to their assay catalog. Atinary, Kebotix, the newly-expanded Lila Sciences "AI science factory" — same shape of trade-off. You adapt your workflow to their platform.

Every platform vendor in this space is selling you the same thing under different names: their optimizer, their instruments, their cloud — and your intellectual property as a tenant on it.

Then there are the Big-4 consultancies that will sell a lab-strategy engagement for $500K to $5M and hand you a vendor-selection deck after the better part of a year — they implement platforms, they don't build optimization engines. And the in-house route, which gives you total control and requires you to staff a Bayesian-optimization and GNN team that takes a year to become productive.

What almost nobody offers is the middle: keep your existing instruments and your data, and add an AI optimization layer that's yours. That gap — between do-it-yourself and platform lock-in — is the entire reason a vendor contract clause reading "your chemical data resides on our infrastructure" makes a careful R&D head pause. It's also where we chose to build.

What Happens When You Forget the Failures

There's a number from Berkeley's A-Lab that I bring up in almost every pitch because it cuts the right way. The A-Lab synthesized 41 novel materials in 17 days — a landmark autonomous result. Its synthesis success rate was 71%.

From an enterprise seat, that 71% reads as its complement. Twenty-nine percent of attempts failed — clogged pipettes, sensor drift, degraded reagents, a furnace that didn't hit temperature. For an academic proof of concept, brilliant. For a regulated pharma line, that 29% is not waste to be hidden. It is data you are legally required to keep.

This is where I've watched the most expensive mistakes happen, and it has nothing to do with chemistry. FDA's data-integrity standard, ALCOA+, requires that records be Attributable, Legible, Contemporaneous, Original, Accurate — and Complete. Complete means the failed experiments are captured too, not just the runs that worked. Most self-driving-lab software silently drops failed runs on the floor. In a research lab nobody notices. In a GxP environment, that omission is precisely the finding that produces a CDER data-integrity warning letter citing missing records.

That's not hypothetical pressure. CDER warning letters jumped 50% in fiscal 2025, with data-integrity issues among the dominant citations; the FDA sent integrity letters to firms like Tyche Industries and Jagsonpal Pharmaceuticals in early 2025. And in January 2026 the FDA and EMA jointly published ten Guiding Principles for Good AI Practice in drug development, centered on data governance, lifecycle management, and human oversight. So when we build a loop for a regulated lab, the audit trail that captures every clogged-pipette failure with a timestamp and provenance isn't a feature we add at the end. It's a constraint we design the loop around from the first line.

"Won't the Platforms Just Do This?" — and Other Things People Ask Me

The question I hear most is whether the well-funded platforms will simply absorb this need. They'll absorb the labs that are happy to be tenants. They won't serve the mid-size lab that has twenty years of proprietary data, a building full of instruments it already paid for, and a board that won't accept its IP living on someone else's cloud. Those two needs are structurally opposed, and funding doesn't reconcile them.

The second question is whether the AI is mature enough to trust. The field's own benchmarks say the modeling has arrived: in the 2026 Matbench Discovery results, 45 models were ranked and the best, PET-OAM-XL, hit an F1 of 0.924 and a discovery acceleration factor above 6x. NC State's flow-driven self-driving lab technique, published in Nature Chemical Engineering in mid-2025, collected 10x more data than prior approaches. The science is ready. What isn't commoditized is wiring it into your instruments under your compliance regime.

The third question is the quiet one: is this just for big pharma? No. The economics bite hardest for the mid-size materials and life-sciences labs whose R&D budget can't absorb a $78,000 year to search 0.0005% of a space, and who can't write a $5M consulting check to be told which platform to buy.

The Part No One Puts on a Slide

I've come to think the romance of the autonomous lab — robots pipetting through the night, AI dreaming up molecules — does the field a disservice. It sells the romance and hides the work. The reason most of these projects stall isn't that the optimizer wasn't smart enough. It's that nobody wanted to spend weeks writing a SiLA 2 driver for a liquid handler, or build an audit trail that records the experiments that failed, or fine-tune a surrogate past its cold start with the right domain chemistry.

That unglamorous work is the work. It's why we build the optimization engines, the instrument integrations, and the closed-loop architecture as one system, for the specific lab in front of us — your instruments, your materials, your data staying yours.

Edison tested thousands of filaments by hand because in his era theory lagged behind experiment. In 2026 that excuse is gone. We can predict before we synthesize, choose the one experiment worth running out of four hundred, and let the lab run itself toward an answer. The labs that keep brute-forcing the search aren't being thorough. They're paying a postdoc's whole year to look at 0.0005% of the picture and calling it diligence.