A verification report shows a green PROVEN checkmark beside a silicon chip die frozen in a red deadlock.
Artificial IntelligenceSemiconductorsMachine Learning

The Formal Verifier Said "Proven." The Chip Was Still Broken.

Ashutosh SinghalAshutosh SinghalMay 21, 202614 min read

The first time one of our own tools lied to me, it did it politely. The formal engine returned a green checkmark. Every property we'd asked it to prove came back proven. The assertion library looked complete. On paper, the design was verified.

It took us two days to realize the certificate was worthless — and that everything I'd assumed about semiconductor AI verification was protecting us from the wrong failure.

The properties had been generated by a large language model we'd wrapped around the formal engine — exactly the kind of slick agentic chip-design demo everyone was building. And the engine had proven them in milliseconds, which should have been the tell. They were proven because the conditions that would test them never fired. A property that says "every request eventually gets a grant" is trivially true if the request can never happen. The solver wasn't wrong. It answered the question we accidentally asked, and the question was empty. That is the moment I understood what these AI verification pipelines actually have to defend against — and why almost nobody selling one was defending against the right thing.

I want to tell you what we found, because the bug classes that kill modern chips are not the ones the industry advertises it can catch. We build custom verification pipelines for fabless chip teams — fine-tuned open-weight models wrapped around the formal engine a team already owns, running entirely on their own hardware — and the whole approach came out of getting this wrong first. You can see the shape of it at veriprajna.com/solutions/semiconductor-ai-verification. But the approach only makes sense once you've watched a "passing" design fail.

A number that should frighten every fabless CFO

One statistic frames the whole problem. The 2024 Wilson Research Group / Siemens EDA Functional Verification study — the longest-running benchmark in the industry — put first-silicon success at 14%. The lowest it has been in twenty years of tracking. In 2020 that number was around 32%. It has more than halved in four years.

That is not a story about engineers getting lazy. It's a story about complexity outrunning the tools, and about a specification that mutates faster than the testbench chasing it. The same study found that roughly 70% of respins are caused by errors tied to specification changes — not pure logic bugs.

First-silicon success didn't fall because chip teams got worse. It fell because the spec now moves faster than verification can close it.

I sat with that 70% number for a long time, because it quietly indicts a lot of "AI for chip design" pitches. If most respins come from spec drift, then a tool that only catches logic bugs is addressing a slice of the problem and calling it the whole pie.

And the cost of being wrong is brutal in a way software people underestimate. A 5nm mask set runs $10–20M. At 3nm you're approaching $40M, with seventy-plus mask layers including more than twenty EUV layers. A full 2nm tape-out now costs on the order of $725M all-in. When something slips through, a respin is a 3-to-6-month schedule slide on top of the mask bill. On an eighteen-month product window, a six-month slip can erase half of a chip's lifetime revenue. The CFO doesn't see verification as a cost center. They see it as insurance against a single eight-figure mistake.

How does an LLM actually hallucinate in hardware?

Five-row table of LLM hardware hallucination classes; only the syntactic class is caught in simulation.

Your team is already using LLMs on Verilog. I'm not going to pretend otherwise, and neither should anyone selling to you. The research field is real now — papers on LLMs writing Verilog went from one in 2020 to sixty-four in 2025. On the harder problems in the VerilogEval benchmark, GPT-4-class models land around 43% functional correctness. Useful. Not trustworthy.

The interesting question isn't whether the models help. It's how they fail, because the failure modes don't look like the failures software engineers are trained to fear. After enough late nights staring at why "correct-looking" RTL went wrong, I started sorting the failures into five classes.

The first is the harmless one. Syntactic hallucination — code that simply won't compile. Verilator or the synthesis front-end catches it in seconds. Everyone already knows how to handle this, and it's the class people point to when they want to claim the problem is solved.

The second is where it starts to hurt. Semantic hallucination, and the textbook case is blocking versus non-blocking assignment. A model trained on Python and C reads Verilog as if statements run top to bottom. So it writes a blocking assignment — a plain = — inside a clocked always_ff block where the language requires the non-blocking <=. The designer wanted a two-cycle pipeline. With blocking assignments, the second stage reads the new value of the first, not the old one, and the silicon ships a one-cycle bypass instead. Some simulators schedule the events in an order that hides the race entirely. It simulates fine. It synthesizes to the wrong machine.

The third class is the one that taught me to keep the AMBA spec open on a second monitor. Protocol hallucination. The code compiles, passes ninety percent of the directed tests, and then asserts a write-data-valid signal before the slave is ready to accept the address — a violation of the AXI handshake. Or it holds a valid signal high while flipping the data underneath it. Or it trips a sub-clause buried on page 84 of the bus specification that no human has memorized. The chip works perfectly on your internal test harness and hangs the instant it's wired to a third-party memory controller. You cannot brute-force this with more simulation cycles. You catch it with pre-verified assertion libraries written against each protocol.

The fourth class is the one that fooled us, and it's the dangerous one.

Why is "proven" the most dangerous word in a verification report?

Vacuity hallucination. The model generates a SystemVerilog assertion. The formal engine proves it. You ship. And the property was trivially true because its antecedent — the "if" part — never became true in the first place.

Picture an arbiter property: if a request is raised, a grant must follow within some number of cycles. Now suppose the same model, trying to be helpful, also wrote an assumption that forces the request signal low. The formal engine dutifully proves the grant property in milliseconds, because there's no request to ever violate it. The real arbiter underneath might be completely broken. The certificate says "proven." It is worthless.

A formal proof that runs in milliseconds isn't fast. It's suspicious. The engine may have proven nothing at all.

This is not a fringe risk. Siemens has been publishing warnings about vacuous proofs since 2017, and the field still ships flows without automatic vacuity checking. When I tell people that a formal verification flow without vacuity detection is worse than no formal verification, they think I'm being dramatic. I'm not. No verification leaves you appropriately scared. A vacuous "proven" leaves you confident and wrong, with a signed-off block heading to tape-out. That's the failure that nearly shipped in our own early prototype, and it permanently changed what we build.

The fifth class is the one simulation structurally cannot see. Clock-domain-crossing blind spots. An LLM reads signal names; it does not perceive clock domains. So it connects a signal from a 2 GHz CPU domain straight into a 400 MHz peripheral domain flop, skipping the double-flop synchronizer that prevents metastability. RTL simulation does not model metastability — so the regression passes, every time, and the silicon deadlocks in the field. This is why CDC bugs eat mask sets. It's also why Accellera opened a CDC/RDC interoperability standard in 2024: the fragmentation across the commercial CDC tools had gotten bad enough to break sign-off.

Classes two through five share one terrifying property: they pass simulation. They only surface in silicon. That's the subset of bugs that still blows up tape-outs, and it's precisely the subset a "catches logic bugs" pitch leaves on the table.

The vendor wall a chip team is actually standing in front of

When I went looking for genuinely independent help on a tape-out, I came up with fewer options than I'd had a few years earlier — and that's not nostalgia, it's the market structure. The three EDA vendors a design-verification lead can actually choose from — Synopsys, Cadence, Siemens — have gone from under 75% of the market a decade ago to over 85% today. Fold in Synopsys's $35B Ansys acquisition and the top four control roughly 90%. The field a fabless startup is shopping in has narrowed, not widened, exactly as the problem got harder.

And the incumbents are not standing still on AI. Synopsys shipped AgentEngineer in March 2026, an agentic verification workflow claiming two-to-five-times productivity, and it sits on top of VC Formal — the most credible vendor agentic stack I've evaluated. Cadence announced its ChipStack AI Super Agent in February 2026 and runs Cerebrus AI Studio for reinforcement-learning-driven implementation, with JasperGold remaining the gold-standard formal engine that everyone else is measured against. I say that without hedging: JasperGold and VC Formal are genuinely excellent. Anyone who opens a pitch by trashing them has told you they've never closed a real tape-out.

The catch is pricing and posture. JasperGold's historical baseline ran around $225K plus $45K per seat — fine for a large fabless company, out of reach for the early-stage RISC-V and AI-accelerator startups doing the most interesting work. And the incumbents' newest AI features are cloud-first, which collides head-on with the one requirement these customers will not bend on. More on that in a moment.

Then there's the startup wave, and it is loud. At the last DAC and DVCon, a DV lead I work with had been pitched eight different "agentic AI for chip design" companies in a single quarter. ChipAgents has raised $74M as of early 2026 and claims ten-times design-and-verification productivity. Normal Computing raised $50M led by Samsung's Catalyst Fund, building auto-formalization — LLMs that translate engineer intent directly into formal properties and prove them — and says half of the world's top-ten semiconductor design firms are using it. There are more: MooresLabAI generating full testbenches, Bronco on regression analysis, Silimate on power-performance-area prediction.

Some of these are real. None of them solves the customer's actual problem, which is that they've now bought three of these point tools and have no idea how to make them work together inside the sign-off flow they already trust.

The reinforcement-learning placement story nobody wants to say out loud

There's a parallel temptation in chip design beyond verification: using reinforcement learning to do the physical floorplanning — deciding where the big blocks sit on the die. It's seductive, and the most-cited result is contested in a way most pitches conveniently skip.

The 2020 Google Nature paper claimed RL beats simulated annealing for macro placement, and it's been used in real silicon — TPU generations shipped with it. But in 2023, Nature added an editorial note after methodological concerns were raised. Igor Markov, now at Synopsys, published a line-by-line critique whose headline comparison is hard to unsee: the RL approach took 32.31 hours, a tuned simulated-annealing baseline took 12.5 hours, and a commercial Cadence tool finished the same job in 0.05 hours. DeepMind fired back with a paper titled "That Chip Has Sailed." Years on, no independent external replication has confirmed the original claims, and both the critique and the rebuttal are still live.

I bring this up because of what happens when a consultant pitches RL placement to a real formal team and pretends the controversy doesn't exist. They smell it inside ten minutes, and you've lost the room. The honest position is that there are genuine niches — chiplet and 3D-IC thermal-aware floorplanning, analog layout, RISC-V IP optimization on open tooling — where a hybrid approach earns its keep. A frontal assault on the incumbents' placement engines does not. Knowing the difference, and saying it out loud, is the whole job.

What changed after our tool fooled us

On-prem pipeline: fine-tuned LLM, your formal engine, a vacuity and coverage gate, then PROVEN.

So here's where the vacuous-proof disaster led us. We stopped trying to build a better AI that writes Verilog, because that's a crowded race with eight funded runners, and started building the thing none of them sells: the vendor-neutral integration layer that makes a team's existing tools trustworthy.

Concretely, that means a few decisions, each forced by a failure we'd seen. We fine-tune an open-weight model — Qwen Coder, Llama, whichever fits — on the customer's own RTL corpus, their specs, their past bug history, because a model that hasn't seen your house style hallucinates against it. We wrap it around whatever formal engine the team already owns: JasperGold, VC Formal, Questa Formal, or open-source SymbiYosys. We never ask them to rip out the gold-standard tool they already trust. And every property the model proposes runs through vacuity and coverage checks before anyone is allowed to read the word "proven" — because we learned the expensive way that an unguarded proof is theater.

The pre-verified assertion libraries matter just as much as the model. For a RISC-V core, the formal harness ships with the AXI4, AHB, and TileLink compliance checks, the pipeline-hazard assertions, the load-store scoreboarding, the debug-unit correctness properties — the things that catch protocol and semantic hallucination before silicon does. This is not speculative. The formal consultancy Axiomise has found 65+ bugs in Ibex, the open RISC-V core inside Google's OpenTitan, including branch-instruction bugs in the debug unit that only formal could catch. Formal works on RISC-V. The scarce thing isn't the method; it's a team that knows how to wield it.

The real opening here was never a smarter Verilog model. It was an honest harness wrapped around the expensive tools a team already trusts.

Why won't you just run it in the cloud?

People ask me this constantly, and the answer is the same reason every cloud-first EDA-AI startup is dead on arrival with the customers I care about: RTL is the crown jewels, and it does not leave the building.

A fabless company's register-transfer-level code is the company. Defense and aerospace customers need it air-gapped, sometimes in a secure facility. Commercial fabless teams want on-prem, virtual-private-cloud at the absolute minimum. So we run everything on the customer's own hardware — vLLM or similar inference on their existing H100 or H200 cluster, local retrieval over their own corpus, no RTL crossing the network boundary, ever. The incumbents' newest agentic features are cloud-first; that's a feature for them and a non-starter for an IP-sensitive chip team. The biggest opening I see isn't an algorithms problem at all. It's a deployment-and-integration problem that the people with the best algorithms have decided not to solve.

The other question comes from the automotive teams, and it's the one that decides everything for them. Ship a chip into a car at ASIL C or D — the safety integrity levels where ISO 26262 stops merely suggesting formal verification and starts mandating it — and the tool itself has to carry a TCL2 or TCL3 qualification package. The incumbents' core engines have that third-party certification; a brand-new AI tool does not. I've watched a genuinely better tool get benched in an automotive program for exactly this reason: no qualification package, no place in the sign-off flow, end of conversation. So we don't try to be the qualified tool. We help teams run AI assistance alongside the qualified incumbent engines without breaking the qualification chain — because in automotive, a technically superior tool that can't survive an audit is worth nothing.

The certificate is not the chip

I keep coming back to those two days we spent trusting a green checkmark that meant nothing. The lesson wasn't "LLMs are bad at Verilog." They're getting better every quarter. The lesson was that in hardware, the gap between looks verified and is verified is measured in eight-figure mask sets and six-month slips — and the tools rushing into this space are mostly racing to make the "looks verified" part faster.

On a $10M mask set, a "proven" you can't trace to a non-vacuous proof, run on hardware you control, against assertions written by someone who's read page 84 of the spec — that's not verification. It's a hope with good production values. If you want to see how we tell the difference, it's all laid out at veriprajna.com/solutions/semiconductor-ai-verification.

The chip doesn't care what the report says. It only does what the silicon was actually built to do. Verification is the discipline of making those two things the same — and the moment an AI tells you they're the same in milliseconds, that's exactly when you should reach for the spec.

Related Research