A visual metaphor contrasting a silicon chip with mathematical proof notation, specific to the semiconductor verification domain.
Artificial IntelligenceSemiconductorHardware

The $10 Million Bug That AI Wrote — And Why I Built a Company to Kill It

Ashutosh SinghalAshutosh SinghalMarch 2, 202616 min read

I remember the exact moment I decided to start Veriprajna.

It wasn't a flash of inspiration. It was a phone call. The kind where nobody speaks for the first five seconds because the person on the other end is trying to figure out how to say what they need to say. A design team — people I respected, people who were genuinely good at their jobs — had just gotten first silicon back from the foundry on a custom RISC-V accelerator. The chip was dead. Not "needs a workaround" dead. Not "we can patch it in firmware" dead. Bricked. Under a specific, rare alignment of thermal throttling and high-bandwidth memory traffic, the arbitration module entered an undefined state and the whole thing deadlocked.

The root cause was a race condition. A single, subtle mismatch between blocking and non-blocking assignments in the RTL — the kind of thing that simulates perfectly, lints without a whisper, passes every regression test you throw at it, and then betrays you in silicon where there are no second chances.

The mask set for that 5nm chip cost roughly $10 million. Gone. But the real damage wasn't the masks. It was the six months needed to diagnose, fix, re-verify, and re-fabricate. In the AI accelerator market, where product generations last about 18 months, a six-month slip can erode 50% of a product's total lifetime gross profit. For a company targeting $100 million in revenue, that race condition didn't cost $10 million. It cost $50 million.

And here's the part that kept me up at night: the code that caused it was generated by an LLM.

The Gold Rush That Nobody's Questioning

Right now, the semiconductor industry is in the middle of a gold rush. Everyone — and I mean everyone — is racing to plug Large Language Models into the Electronic Design Automation (EDA) workflow. The pitch is irresistible: take design cycles that used to last years and compress them to months. Democratize chip design. Let AI handle the tedious register-transfer level coding so engineers can focus on architecture.

I get the appeal. I've felt it myself. The first time I watched an LLM generate a syntactically correct Verilog module from a natural language prompt, I thought: this changes everything.

But then I started looking more carefully at what "syntactically correct" actually means when you're writing hardware instead of software. And I realized the industry was building on a foundation of sand.

The tools flooding the market right now are what I call "Wrappers." They take a general-purpose LLM — GPT-4, Claude, Llama — wrap it in a chat interface with some Verilog-flavored system prompts, and sell it as a "Chip Design Copilot." Some of them are genuinely impressive demos. The code looks clean. It compiles. It even simulates.

But here's what these tools fundamentally are: stochastic token predictors with a hardware skin. They don't understand circuit topology. They don't understand timing closure. They don't understand metastability. They predict the next likely token based on statistical patterns in their training data.

When an LLM hallucinates in software, you get a runtime error you can patch over the air. When an LLM hallucinates in hardware, you get a $10 million paperweight.

That asymmetry is the entire reason Veriprajna exists.

Why Does AI Write Bad Hardware Code?

Side-by-side comparison showing how an LLM's sequential bias causes it to misinterpret concurrent hardware execution, using the blocking vs non-blocking assignment example from the article.

This is the question I get asked most often, usually by software engineers who've seen LLMs write perfectly functional Python and can't understand why Verilog is different. The answer goes deeper than "hardware is harder." It's a fundamental mismatch between how LLMs think and how hardware works.

LLMs are trained overwhelmingly on software — Python, Java, C++, JavaScript. These are imperative, sequential languages. Line A executes, then Line B. The state of the system is defined by the order of operations. LLMs have internalized this paradigm so deeply it's basically their native tongue.

Verilog and VHDL are declarative and concurrent. In a hardware module, every always block, every assign statement, every module instantiation executes simultaneously and continuously. The order of lines in the source code often has zero bearing on the order of execution in the silicon.

I watched this play out in real time during one of our early experiments. We asked three different frontier LLMs to implement a simple three-stage pipeline register. All three used blocking assignments (=) instead of non-blocking assignments (<=). The code looked perfectly reasonable if you squinted at it like C code. But in hardware, blocking assignments in a clocked block mean stage2 gets updated with stage1's value immediately, and then stage3 gets the new value of stage2 — effectively collapsing a two-cycle pipeline into one cycle.

One of my engineers, Priya, stared at the output for a minute and said, "It's writing C with semicolons in the wrong places." She was exactly right. The LLM had sequential bias — it was treating Verilog like a programming language when it's actually a description language. The distinction is subtle syntactically and catastrophic physically.

And it gets worse. Hardware design relies on strict interface protocols — AXI, AHB, PCIe, TileLink — with complex temporal rules. "Ready must not wait for Valid." "Grant must be asserted within 5 cycles." LLMs can generate code that respects these rules 90% of the time, which sounds great until you realize that the 10% it gets wrong are precisely the corner cases that kill chips. An AXI master that asserts WVALID before AWREADY in a specific sub-clause violation won't throw a syntax error. It'll compile, simulate, and then hang when connected to a compliant memory controller in silicon.

The training data problem compounds everything. The volume of high-quality, production-grade Verilog available for training is orders of magnitude smaller than Python or JavaScript. Much of the open-source Verilog on GitHub is student projects, abandoned prototypes, toy implementations that would never survive a tape-out review. When you train on mediocre data, you get mediocre outputs — outputs that look professional but contain the DNA of amateur mistakes.

The Rule That Haunts Every Chip Designer

Infographic showing the Rule of Ten — exponential cost escalation of bug fixes across five design stages, with specific dollar amounts from the article.

There's a heuristic in semiconductor design called the "Rule of Ten," and once you understand it, you understand why I'm so obsessed with catching bugs early.

The cost to fix a defect increases by 10x at each subsequent stage of the design lifecycle. A bug caught during RTL design costs about $100 to fix — someone edits a file and reruns a check. The same bug caught during block verification costs $1,000. At system verification, $10,000. If it escapes to post-silicon validation — when you're debugging actual chips in a lab — you're looking at $10 million or more for a respin. And if it reaches customers in the field? That's $100 million territory. Recalls, lawsuits, brand destruction. Ask Intel about the Pentium FDIV bug.

The Wrapper AI tools operate almost exclusively at the RTL design stage. They help engineers write code faster. But because they lack any verification capability beyond "does it compile," they inject bugs that sail through block and system verification, only to detonate in silicon.

Here's the cruel irony: by increasing the velocity of code generation without increasing the rigor of verification, these tools accelerate the injection of high-cost defects into the pipeline. You're not just moving fast and breaking things. You're moving fast and baking bugs into $10 million mask sets.

Industry data bears this out. Only 32% of designs achieve first-silicon success. The remaining 68% require at least one respin, and the primary cause is logic and functional flaws — exactly the kind of errors LLMs generate when they hallucinate protocols or misunderstand concurrency.

I explained this to an investor once, early in our fundraising. He listened patiently, then said: "Can't you just use GPT-4 with better prompts?"

I pulled up the pipeline register example. I showed him the blocking assignment bug. I showed him that it passed linting, passed simulation, passed every automated check the Wrapper tools offered. Then I showed him what it would do in silicon.

He didn't ask about better prompts again.

What If You Could Prove Code Correct Instead of Just Testing It?

This is where the story turns. Because the answer to the LLM hallucination problem isn't better prompts, bigger models, or more training data. It's a fundamentally different approach to verification.

Traditional verification relies on simulation — you write testbenches, run millions of cycles, and check if the design does what you expect. This is like testing a car's brakes by driving around the block a thousand times. If the brakes don't fail, you assume they're safe. But what if they only fail when it's raining, you're going exactly 62 mph, and the radio is tuned to a specific frequency? Simulation can only verify the scenarios it explicitly tests. Everything else is a prayer.

Formal Verification doesn't run the design at all. It converts the entire design into a mathematical formula and uses Satisfiability Modulo Theories (SMT) solvers — tools like Microsoft's Z3 — to exhaustively prove that a property holds under every possible input combination and internal state. Every single one. Not a sample. Not a statistical approximation. A mathematical proof.

Simulation asks: "Does this work in the cases I tested?" Formal verification asks: "Is there any possible case where this fails?" The difference is the difference between hope and proof.

When the solver returns "UNSAT" — unsatisfiable — it means no counterexample exists. The property is mathematically guaranteed. When it returns "SAT," it hands you a specific sequence of inputs that breaks your design, down to the exact clock cycle.

Formal verification has existed for decades. The reason it hasn't taken over the industry is that writing the formal properties — SystemVerilog Assertions, or SVA — is notoriously difficult. It requires a specialized skill set that most design teams don't have. The assertions are the "contract" for the hardware: "If request goes high, grant must follow within N cycles." "Data read from address X must match the last data written to address X." "The pipeline must never deadlock." Writing these correctly is an art form, and there aren't enough practitioners to go around.

Which is exactly where AI becomes useful — not for writing the hardware code, but for writing the proof.

The "Formal Sandwich" — How We Actually Built This

Architectural diagram of the Formal Sandwich loop showing how specification flows into dual artifact generation (RTL + assertions), formal solver verification, and the counter-example feedback loop back to the LLM.

I spent months arguing with my team about the right architecture. The debate was fierce and, in retrospect, clarifying. One camp wanted to fine-tune an LLM until it generated correct Verilog by default. The other camp — the one I eventually sided with — argued that correctness-by-training was a fantasy. You can't train away hallucination. You can only catch it.

We landed on what we call the "Formal Sandwich" — a neuro-symbolic architecture where the LLM is the creative engine and a formal verification solver is the unyielding critic. Neither works alone. Together, they do something neither can do independently.

Here's how it works in practice. A designer provides a specification — "Design an APB-to-AXI bridge" or even a timing diagram screenshot. Our Spec Analyzer agent decomposes this into functional requirements. Then comes the key innovation: instead of generating just code, the LLM generates two artifacts simultaneously.

Artifact A is the RTL implementation — the Verilog code itself. Artifact B is the formal specification — a set of SVA properties derived from the same requirements. If the spec says "Grant must follow Request," the LLM generates the state machine and the assertion that proves the state machine does what it claims.

Then we unleash the solver. It takes Artifact A and tries to break it using Artifact B. First, a vacuity check — making sure the assertions aren't trivially true (a "lazy" generation where the trigger condition never fires). Then bounded model checking, exploring deep state spaces — 50, 100 cycles deep — hunting for deadlocks, race conditions, protocol violations.

If the solver finds a bug, it doesn't just flag it. It produces a counter-example trace — a precise waveform showing exactly how the bug manifests. And here's where the loop closes: we feed that trace back into the LLM as a prompt. "Your design failed. Here's the trace: Cycle 1, Reset deasserts. Cycle 2, Request goes high. Cycle 10, Grant is still low. The grant never arrived. Fix the state machine."

The LLM analyzes the trace, identifies the missing state transition, rewrites the code. The solver checks again. This loop repeats automatically until the design is proven correct.

I wrote about this architecture in much more depth in the interactive version of our research, but the core insight is simple: we use AI to write the proof, and math to check the AI. Neither trusts the other. Both make the other better.

The Bugs That Made Me a Believer

I became a true believer in this approach not through theory but through specific bugs we caught that nothing else would have found.

The open-source RISC-V community has produced genuinely excellent processor cores — Ibex (used in Google's OpenTitan security chip), the PULP platform from ETH Zurich. These are heavily scrutinized designs with real engineering talent behind them. And they still contain bugs that only formal verification can find.

Axiomise, a formal verification consultancy, found a bug in the Ibex core where a debug request arriving at a specific cycle during a branch instruction could cause the core to deadlock or execute the wrong instruction. Think about that — a security-critical core, reviewed by dozens of engineers, and a formal tool found a bug that simulation missed entirely.

In the PULP platform, a bug was found where the AXI interconnect could starve a bus master indefinitely under a specific "busy" pattern of AWVALID and AWREADY interactions. A classic liveness failure — the system doesn't crash, it just stops making progress. You'd never write a directed test for that specific interaction pattern. There are too many possible patterns to enumerate.

When we point Veriprajna at a RISC-V Load-Store Unit, it automatically generates assertions for interface compliance ("if valid is asserted, it must remain high until ready"), data integrity ("data read from address X matches the last write to address X"), and forward progress ("the unit must eventually return a response"). These aren't afterthoughts bolted onto the code. They're generated alongside the code, from the same specification, and enforced before a single line of RTL leaves our system.

For the full technical breakdown of our methodology and the formal verification engine, see our detailed research paper.

"But Formal Verification Doesn't Scale"

People always push back on this point, and I understand why. Formal verification has a reputation for being computationally explosive — the state space of a modern SoC is astronomically large, and naive formal approaches choke on anything bigger than a toy design.

We've spent significant effort on this. Our system uses automated abstraction techniques to make formal tractable at scale. Black-boxing lets us verify the glue logic while treating large sub-blocks like RAMs or complex ALUs as abstract entities with defined interfaces. Cut-points break valid/ready handshake paths so we can verify flow control independently of data processing. Symmetry reduction lets us prove a property for one channel of a multi-port router and mathematically induce it for all N channels.

Is it solved completely? No. Analog physics will always present challenges that formal methods can't touch. But the logic bugs — the race conditions, the deadlocks, the protocol violations — become mathematically impossible in the generated code. And those are the bugs that cause respins.

The other objection I hear is about speed. "Doesn't running a formal solver slow down the design process?" Yes, it adds computational cost. But I'll trade compute time for schedule certainty every single day. A formal solver running for an extra hour is infinitely cheaper than a six-month respin.

The Uncomfortable Truth About "AI-Designed Chips"

There's a narrative gaining traction in the industry — that AI will soon design chips end-to-end, that we're moving from Computer-Aided Design to Computer-Automated Design. I believe that narrative is directionally correct but dangerously incomplete.

We're building toward agentic workflows where autonomous AI agents collaborate — an Architect agent for high-level partitioning, an RTL Coder for implementation, a Verification Engineer for writing testbenches and assertions, a Manager for orchestrating the flow against power, performance, and area constraints. We use Retrieval-Augmented Generation (RAG) not just for code but for knowledge — pulling specific protocol rules, process design kit constraints, and internal coding standards so the LLM generates compliant code without hallucinating.

But none of this works — none of it — without a formal verification backbone. The more autonomous the AI becomes, the more critical it is that every output is mathematically verified before it moves downstream. An AI agent that generates code faster is only valuable if that code is correct. An AI agent that generates code faster and proves it correct? That's the future.

The semiconductor industry's problem isn't that AI is too slow at writing hardware code. It's that AI is too fast at writing hardware code that's subtly wrong.

We are not a copilot. We are not a chatbot. We are a formal verification foundry that happens to use generative AI as its front end. The distinction matters because it determines what you're optimizing for. Copilots optimize for speed. We optimize for correctness. In a world where a single escaped bug costs $10 million in masks and $50 million in lost revenue, I know which optimization I'd choose.

The Choice Is Already Made

The semiconductor industry can no longer afford the "generate and pray" approach. The Rule of Ten is not a suggestion — it's physics and economics conspiring to punish anyone who ships unverified silicon. At 5nm and below, with mask sets approaching $20 million, the margin for error has collapsed to zero.

Every week, I talk to design teams who are excited about what LLMs can do for their productivity and terrified about what LLMs might do to their tape-out schedule. They're right to feel both things simultaneously. The technology is genuinely transformative. It is also genuinely dangerous without a safety net.

Veriprajna is that safety net. We give you the speed of AI with the certainty of math. Not "probably correct." Not "passed regression." Proven correct, across every possible input, every possible state, every possible corner case that a simulation would never think to test.

The choice facing every chip designer today isn't whether to use AI. That ship has sailed. The choice is whether to use AI that can prove its own work, or AI that just hopes for the best.

I know which one I'd bet $10 million on.

Related Research