
418 Kilovolts Looked Fine. Then 60 Million People Lost Power.
On the afternoon of April 28, 2025, the screens in a Spanish control room said everything was fine.
The 400 kilovolt transmission buses were reading around 418 kV — high, but inside the limits an operator is trained to tolerate. Then, in five seconds, 15 gigawatts of generation tripped offline and 60 million people across Spain and Portugal lost power. Some of them stayed dark for ten hours.
I have spent my career around interconnection cluster studies and capacity-auction post-mortems, and when the ENTSO-E final report came out in March 2026, I read it the way you read an autopsy of someone you knew. The headlines had already decided the story: too much solar, not enough spinning metal, renewables can't hold a grid together. The report says, in plain language, that this framing is wrong. The grid didn't fail because of what it was generating. It failed because of what nobody was watching. That single distinction is the reason my team builds power grid AI the way we do.
The blind spot was two voltage levels down

The investigation found something that kept me up.
Starting late morning, sub-synchronous oscillations appeared across the Spanish grid at 0.21 Hz and 0.63 Hz. Operators did the textbook thing — disconnected shunt reactors to manage transient undervoltages — which quietly drained the grid's capacity to absorb reactive power. Around midday they energized parallel 400 kV circuits and switched HVDC links to fixed-power mode. Impedance dropped, voltages climbed. The 400 kV displays still read 418 kV. Nominal.
Two layers down, at the 220 kV collector substations where wind and solar farms actually connect, voltage was hitting 242 kV. The transformer tap-changers — the mechanical devices that nudge voltage back into range — couldn't move fast enough. And nobody saw it, because real-time monitoring effectively stopped at the transmission level. Then one large generation facility, during an overvoltage event, injected reactive power into the grid instead of absorbing it, exactly the opposite of what its grid code required. That was the positive feedback loop. Protective relays began tripping in a cascade, and the lights went out.
The renewables weren't unstable. The observability was.
I keep coming back to that 418 kV number. An operator looking at a calm transmission reading had no instrument telling him the collector level was already past the point of no return. The sensors to see 220 kV exist. The analytics to interpret them in real time, fast enough to act, did not. That is not a generation problem. It is an information problem wearing a generation problem's clothes.
I was sure the answer was an autonomous controller. I was wrong.
When my team first took this on, I had a confident, tidy idea: build a physics-informed neural network — a model that bakes the actual equations of power flow into its training so it can't recommend something physically impossible — and let it manage voltage and reactive power faster than any human or legacy controller could.
The research momentum behind these models is real. Physics-informed neural network publications went from fewer than ten in 2019 to around 820 in 2025. We built a prototype. On most scenarios it was genuinely impressive — it generalized better than the standard models and respected the physics.
Then we ran it through the edge cases. And in one of them, under exactly the kind of oscillation stress the Iberian grid saw, it produced a recommendation that was physically infeasible. Not catastrophically wrong in an obvious way — subtly wrong, the kind of wrong that looks plausible on a dashboard.
I sat with that for a while. Then I did the thing that ended the whole approach: I imagined that recommendation arriving in a real control room, at 12:32 on a day like April 28, with a cascade closing in and an operator deciding in seconds whether to trust it. There is no commercially deployed physics-informed neural network controlling a real grid anywhere in the world as of early 2026, and after that test I understood exactly why. An idealized equation set omits the multiscale messiness — boundary complexity, nonlinear coupling — that shows up precisely when the grid is stressed and you need the model most. Two unsolved problems compound it: balancing the loss-function weights that trade physics fidelity against data fit, and quantifying when the model is uncertain — neither of which you want to discover live.
An investor had told me earlier that grid operators would never adopt AI. He was wrong about the reason, but right about the instinct. The reason isn't that operators are Luddites. It's that they are personally accountable for 60-million-person outcomes, and a black box that is occasionally, confidently, invisibly wrong is worse than no box at all.
We stopped trying to build something that replaces the operator's judgment and started building something that extends the operator's sight.
That pivot — from autonomous controller to physics-informed advisory layer — is the whole architecture now. The model surfaces what the operator can't currently see, explains why it's flagging it, and leaves the setpoint in human hands. Argonne National Laboratory's GridMind project, a multi-agent system for control-room operators, landed on the same design principle from the research side: the agents don't replace operators, they give explainable recommendations. When independent teams converge on the same constraint, it's usually because the constraint is real.
Why doesn't the existing grid software already do this?
It's a fair question, because the grid is not a greenfield. Some of the most capable software companies on earth already live here.
GE Vernova's GridOS platform is installed across more than 80% of US utilities; it prevented an estimated 112 million customer-minutes of interruption for Alabama Power in 2025 alone. Siemens' Gridscale X runs grid digital twins and, through an NVIDIA partnership, claims roughly 10,000x acceleration on simulation. These are serious tools built by serious people.
But they were architected for a grid that flowed one direction, from big rotating generators out to passive consumers. The AI capabilities are add-ons layered onto SCADA systems that predate the problem. And the data they ingest is overwhelmingly transmission-level — the 400 kV view that read 418 kV and looked fine. The blind spot isn't a bug in these platforms. It's a consequence of where they were told to look.
Then there's the consulting tier. McKinsey was contracted to redesign ERCOT's interconnection process and filed a "Batch Study" framework for the Texas PUC's February 2026 docket. That's real work. But the big firms advise on process — they deliver strategy decks and vendor evaluations on engagements that run into the millions — they don't build the physics-informed models. The gap between "here is a better process" and "here is a working system that screens your queue" is exactly where we operate.
The same blind spot is about to cost America $163 billion
If you think this is a European story, look at the PJM capacity auction.
In December 2025, for the 2027/2028 delivery year, PJM — the largest grid operator in North America — procured 145,777 MW and fell 6,625 MW short of its reserve-margin target. It was the first time the entire region failed to meet its reliability target. The price hit the auction cap for the third straight time; the single auction cost $16.4 billion. The cumulative capacity cost from 2028 to 2033 is projected at $163 billion. In ComEd's territory alone that's an estimated $70 a month on a residential bill.
Now the part that connects directly to the observability story: of the 5,250 MW of demand growth driving that shortfall, about 5,100 MW is data centers. That load isn't spread evenly — it's concentrated in specific transmission zones in Dominion Virginia, AEP Ohio, and ComEd Illinois. You are building Iberian-blackout conditions — localized stress around critical substations — and then hoping no single one trips during peak. Hope is not an observability strategy.
And the supposed relief valve, new generation, is jammed. The US interconnection queue stood at roughly 2,600 GW in 2025. The median project waits five years to reach commercial operation, and historically only about one in five projects that entered the queue between 2000 and 2018 ever got built. ERCOT's large-load queue alone hit 233 GW — a 269% jump year over year, 77% of it data centers — against just 23 GW of new generation synchronized in 2025. FERC's own analysis found that 68% of the cluster studies meant to fix this were completed late. GridLab estimates that interconnection bottlenecks cost PJM consumers $3.5 billion in a single capacity auction, because cheaper generation that should have cleared simply couldn't connect in time.
You cannot hire your way through 2,600 GW of queue with human engineering hours. The studies will always come back stamped "completed late." This is the second place AI belongs — not controlling the grid, but screening and clustering the queue at a speed humans structurally can't match.
If Dynamic Line Rating Is So Cheap, Why Isn't It Everywhere?
There's a third gap, and it's the one that makes me a little crazy, because the technology is already proven.
Dynamic line rating — putting non-contact sensors on transmission lines so you rate their capacity by actual conditions instead of a conservative worst-case assumption — is not experimental. LineVision is the dominant vendor. On a 345 kV corridor, AES saw a 61% capacity increase. National Grid's Syracuse deployment added 20-30%. And it costs something like 5-7% of a traditional reconductoring project — in the AES case, roughly 76% less than the conventional upgrade.
So why hasn't every constrained utility blanketed its network with it? Because the hardware vendor sells you sensors, not a plan. Nobody answers the actual planning question: given my specific network, where do I deploy grid-enhancing technology to unlock the most capacity per dollar, and when? That's an optimization problem over a utility's own topology and load patterns — and it's an analytics layer that sits on top of the sensors, not inside them. FERC's Order 1920 now requires utilities to consider grid-enhancing technologies before building new lines, which means this question is about to land on a lot of planning desks at once.
Dynamic line rating is the rare grid intervention that's both the cheapest option and the one most often left on the table.
Where AI actually belongs on the grid

Pulling the threads together, my position is narrower than the hype and, I think, more useful.
AI does not belong in autonomous control of a safety-critical grid — not today, and not on the strength of a model that gets edge cases subtly wrong. It belongs in three places where the current software stack has a structural hole. First, collector-level observability — the sub-transmission analytics that would have caught 242 kV at the 220 kV level while the 400 kV bus looked calm. Second, queue intelligence — screening and clustering 2,600 GW worth of interconnection requests faster than cluster studies completed late ever will. Third, a grid-enhancing-technology optimization layer that tells a planner where dynamic line rating and power-flow control actually pay off. Underneath all three, physics-informed models earn their keep as advisory and simulation tools for planning studies — dramatically faster than legacy tools, with the operator still holding the pen.
People ask me whether utilities really want yet another vendor. They don't, and we don't ask them to become one. Utilities have GE, Siemens, and ABB SCADA systems they're not ripping out. What they need is an AI overlay that works with the installed base, vendor-neutral, rather than another monolith demanding replacement. That's a deliberate design choice, and it's on the solution page in detail.
They also ask about the regulatory pile-up, and it is real. The EU AI Act classifies AI in grid management as high-risk, with compliance due August 2, 2026 and penalties up to €15 million or 3% of global turnover. NERC's CIP-003-9 takes effect in April 2026. FERC Order 1920's grid-enhancing-technology requirements are landing now. Spain's response to the blackout — Red Eléctrica authorizing 24 renewable facilities for dynamic voltage control under P.O. 7.4, requiring them to demonstrate ±30% reactive-power capability — is exactly the kind of mandate that needs new analytics to actually verify and operate. Compliance and capability aren't separate projects here. The thing that keeps you legal is the same thing that keeps you lit.
The Iberian report closed a chapter by naming, precisely, the moment the grid went blind. I don't read an argument about renewables in it. I read an instrument pointed at the wrong place. Building the instrument that points at the right one — that's the work.


