Grid Intelligence & Resilience

The Grid Cannot Grow Fast Enough.
It Needs to Get Smarter.

PJM fell 6,625 MW short of its reliability target for the first time in history. ERCOT's interconnection queue hit 233 GW with only 23 GW of new generation online. The Iberian blackout wiped out 15 GW in 5 seconds because no one was watching the right voltage level.

These are not isolated incidents. They are symptoms of grids that were designed for one-directional power flow now managing bidirectional, intermittent, data-center-driven load patterns with tools built for the last century. We build the AI systems that close the gap between what the grid needs and what its current software can deliver.

$163B

Projected PJM capacity costs, 2028-2033

NRDC/CUB Analysis, 2025

2,600 GW

US interconnection queue backlog

Lawrence Berkeley Lab, 2025

15 GW in 5s

Lost in the 2025 Iberian blackout

ENTSO-E Final Report, March 2026

What Actually Went Wrong on April 28, 2025

The Iberian blackout is the most instructive grid failure in a decade. Not because of what the headlines said (renewables are unstable), but because of what the ENTSO-E investigation actually found: a specific, preventable failure chain that current monitoring architectures cannot detect.

The Cascade, Step by Step

09:00-12:00

Sub-synchronous oscillations at 0.21 Hz and 0.63 Hz appear across the Spanish grid. TSOs disconnect shunt reactors to manage transient undervoltages during dampening. This depletes reactive power absorption capacity.

12:00-12:31

TSOs energize parallel 400 kV circuits and switch HVDC links to fixed-power mode. Transmission impedance drops, voltages rise. The 400 kV monitoring shows 418 kV. Within nominal limits.

12:31-12:32

The observability gap. While transmission readings look normal, collector-level substations at 220 kV are hitting 242 kV. Transformer tap-changers cannot adjust fast enough. No one sees this because monitoring stops at the transmission level.

12:33:10

One major generation facility injects reactive power into an already overvoltage grid, instead of absorbing it as P.O. 7.4 requires. Positive feedback loop. Cascading protective trips begin. 15 GW disconnects in 5 seconds. 60 million people lose power.

The lesson is not that renewables are unreliable. The ENTSO-E report explicitly rejects that framing. The lesson is that the monitoring architecture has a blind spot at the collector level, and legacy PI/PID controllers cannot handle the nonlinear dynamics of a low-inertia grid under oscillation stress.

This same pattern applies in the US. PJM's 6,625 MW shortfall is driven by data center load (5,100 MW of the 5,250 MW forecast increase) concentrated in specific transmission zones. Localized stress points in Dominion Virginia, AEP Ohio, and ComEd Illinois create the same conditions for cascading failure if a critical substation trips during peak demand. The question is not whether it will happen, but whether the monitoring is in place to catch it before it cascades.

Who Else Works on This

Grid AI is not a greenfield. Before engaging a consultancy, understand what the incumbents, startups, and national labs are already doing, and where the gaps remain.

Provider What They Offer Strengths Where the Gaps Are
GE Vernova (GridOS) Full-stack grid management. ADMS, DERMS, digital twins. GridOS for Distribution launched Feb 2026. Installed in 80%+ of US utilities. Prevented 112M customer-minutes of interruption for Alabama Power in 2025. Legacy architecture. AI capabilities are add-ons to existing SCADA, not physics-native. Vendor lock-in makes customization expensive.
Siemens (Gridscale X) Grid digital twins, dynamic security assessment, DLR module. NVIDIA PhysicsNeMo partnership for 10,000x simulation acceleration. Decades of PSS/E grid modeling. Strong EU presence. Trieste digital twin deployment. Monolithic platform. Expensive for mid-size utilities. DLR module is narrower than dedicated analytics.
LineVision DLR sensors and analytics. Non-contact overhead line monitoring. Dominant DLR vendor. AES: 61% capacity increase on 345 kV. National Grid Syracuse: 20-30% increase. 5-7% the cost of traditional upgrades. Hardware-focused. Limited analytics for corridor prioritization and planning integration. Does not address queue or stability challenges.
Utilidata + NVIDIA Karman: AI chip embedded in smart meters. Edge computing for distribution grid. $60.3M Series C. Portland General Electric and Duquesne Light deployments. Deloitte partnership. 100x processing power vs. traditional meters. Distribution-focused. Does not address transmission-level stability, interconnection queues, or cross-border resilience.
Argonne GridMind Agentic AI copilot for control room operators. Multi-agent LLM system for scheduling and outage simulation. DOE backing (Genesis Mission). Strong research credibility. Explainable recommendations. Research-stage. Not a commercial product. No utility deployment timeline. Physics constraints are not embedded in the LLM architecture.
EPRI RADAR Global framework for grid defense, analytics, and resilience. Duke Energy and RTE as founding members. Industry-wide initiative. Standard-setting influence. Training programs for utility personnel. Framework, not software. Does not build tools; publishes guidelines. Moves at committee speed.
Big 4 / Large SIs Deloitte, Accenture, McKinsey, etc. Strategy consulting, platform implementation, vendor selection. Organizational change management. Procurement relationships. McKinsey contracted for ERCOT queue redesign. They advise on process; they do not build physics-informed models. Engagements run $2M-$20M+ and deliver strategy decks and vendor evaluations, not working AI systems.
Honest gaps nobody solves well Legacy data quality at individual utilities (decades of inconsistent SCADA archives). Organizational readiness for AI in risk-averse control rooms. Long NERC CIP-013 vendor qualification timelines (6-12 months, regardless of vendor). These are constraints that affect every vendor and consultancy equally, including us.

What We Build for Grid Operators

Each engagement is custom. These are the capability areas where we have depth, not a product catalog. We work with your existing SCADA/EMS vendor, not against them.

1

Interconnection Queue Intelligence

For ISOs/RTOs drowning in queue volume. We build NLP screening that extracts application parameters and assigns completion probability scores using historical queue data. GNN-based topological clustering groups projects by electrical proximity for FERC Order 2023 cluster studies, not by arrival time. Automated power flow pre-screening runs thousands of injection scenarios against the network model.

The shift from first-come-first-served to first-ready-first-served requires tools that understand grid topology, not just spreadsheets.

2

Grid Resilience Analytics

Physics-informed simulation models that run N-1/N-2 contingency analysis orders of magnitude faster than PSS/E. We embed swing equations and Kirchhoff's laws into the model training, so results respect grid physics rather than just learning statistical patterns. 10,000 contingency scenarios in hours, not months.

These are planning-stage advisory tools, not real-time controllers. PINNs are not production-ready for autonomous grid control, and we are honest about that.

3

DLR Optimization Analytics

LineVision provides the sensors. GE Vernova provides the SCADA. The missing layer is analytics that tell you where to deploy DLR for maximum capacity unlock, how seasonal weather patterns affect rating windows, and how to integrate dynamic ratings into planning workflows designed around static ratings. We build that analytics layer.

FERC Order 1920 requires GETs evaluation before traditional construction. We provide the quantitative analysis to satisfy that requirement with corridor-specific data.

4

Collector-Level Observability

The Iberian blackout happened because monitoring stopped at the transmission level. We build edge analytics for sub-transmission voltage and reactive power monitoring at the 220 kV collector level: the exact blind spot that ENTSO-E identified. Anomaly detection runs alongside existing SCADA, not instead of it.

Read-only integration in Phase 1. We consume SCADA telemetry and state estimator outputs without writing back to the control system. Zero disruption to existing protection schemes.

5

Grid AI Compliance & Governance

Three regulatory timelines are converging: EU AI Act high-risk conformity assessment (August 2026 deadline, EUR 15M penalty), NERC CIP-003-9 security management (April 2026), and FERC Order 1920 GETs evaluation requirements. We build the documentation, testing protocols, and audit frameworks that satisfy all three.

Most grid operators running AI for demand forecasting or DER management have not audited whether those systems qualify as high-risk under the EU AI Act. We start there.

Why Not a Bigger Firm?

McKinsey is redesigning ERCOT's queue process. They deliver process recommendations. We deliver working queue screening models trained on your historical data. Deloitte partnered with Utilidata on grid edge. Their role is systems integration and change management. Our role is building the physics-informed models that the systems integration wraps around. The Big 4 are complementary to what we do, not competitive. They handle organizational readiness and vendor procurement. We build the AI that the organization runs.

How We Work

Grid operators plan in regulatory cycles. Our engagement phases align with how ISOs and utilities actually budget, approve, and deploy technology.

0-6 months

Phase 1

Assessment & Quick Wins

  • Data audit: Map existing SCADA, IoT, and weather data sources. Identify gaps in collection frequency, archive quality, and format consistency. Most utilities discover their historical data is less complete than assumed.
  • DLR corridor prioritization: If DLR sensors are deployed, analyze which corridors yield maximum capacity unlock. If not, identify the top 5 congested corridors where DLR would defer planned upgrades.
  • Regulatory baseline: Audit existing AI systems against EU AI Act high-risk criteria and NERC CIP-003-9 requirements. Produce gap analysis and compliance roadmap.
  • Queue diagnostic (ISOs): Profile the interconnection queue. Identify phantom load patterns, cluster candidates, and fast-track opportunities.

6-18 months

Phase 2

Build & Integrate

  • Queue intelligence platform (ISOs): Deploy NLP screening, topological clustering, and automated pre-screening. Calibrate against historical queue outcomes. Integrate with existing planning tools.
  • Contingency simulation: Build PINN-based advisory models for N-1/N-2 analysis. Validate against PSS/E baseline. Deploy as a planning accelerator alongside, not replacing, existing tools.
  • Collector-level monitoring (post-blackout): Deploy anomaly detection at sub-transmission substations. Read-only SCADA integration via IEC 61850 and ICCP/TASE.2.
  • NERC CIP-013 package: Prepare vendor risk management documentation for utility security team evaluation. Account for the 6-12 month qualification timeline.

18-36 months

Phase 3

Scale & Optimize

  • Cross-corridor DLR analytics: Expand from pilot corridors to system-wide dynamic rating integration. Address seams issues where adjacent utilities rate shared corridors differently.
  • Advisory control recommendations: Graduate from monitoring to human-in-the-loop advisory signals for reactive power management and congestion relief. Operators retain full authority.
  • Continuous compliance: Post-market monitoring for EU AI Act conformity. Ongoing NERC CIP documentation as standards evolve (CIP-015 internal network security monitoring is coming).

Caveat: Phase 3 timelines depend on regulatory approval processes (FERC, NERC, state PUCs) that we cannot control. We plan for 2-3 year regulatory cycles, not 6-month startup sprints.

Grid AI Readiness Assessment

Answer six questions about your current grid infrastructure and data maturity. The assessment identifies your starting point and recommends specific next steps, whether or not you work with us.

Question 1 of 6

What is your organization type?

Questions Grid Operators Ask

How does AI reduce interconnection queue backlog for ISOs and utilities?

The US interconnection queue has swelled to 2,600 GW with a median five-year wait for commercial operation. The bottleneck is human engineering hours, not policy. FERC Order 2023 mandates cluster studies, but ISOs lack the staff to process clusters within 150-day timelines.

AI addresses this at three points. First, NLP-based application screening extracts key parameters (MW, location, technology type, developer financial backing) from interconnection applications and assigns a completion probability score based on historical patterns. In ERCOT, where 77% of the 233 GW queue is data center load, this separates credible demand from speculative phantom applications. Second, GNN-based topological clustering groups projects by electrical proximity and grid impact zone rather than arrival time, producing study clusters that match how the grid actually behaves. Third, automated power flow pre-screening runs thousands of injection scenarios against the existing network model to identify which projects can proceed without major upgrades.

The result is a shift from first-come-first-served to first-ready-first-served. For context, GridLab found that if just 10% of queued renewables in PJM had connected in time for the 2026/2027 auction, consumers would have saved $3.5 billion in a single capacity auction.

What caused the 2025 Iberian blackout and how does AI prevent similar cascading failures?

The April 28, 2025 Iberian blackout resulted from a specific failure chain documented in ENTSO-E's March 2026 final report. With 78% renewable penetration that morning, sub-synchronous oscillations at 0.21 Hz and 0.63 Hz appeared. TSOs responded by meshing parallel 400 kV circuits, which raised transmission voltages. The critical gap: 400 kV readings looked nominal, but collector-level substations at 220 kV were experiencing overvoltage because transformer tap-changers could not adjust fast enough. One major generation facility injected reactive power during the overvoltage instead of absorbing it, creating a positive feedback loop. Within 5 seconds, 15 GW disconnected and 60 million people lost power.

The root cause was an observability gap: TSOs monitored transmission but not collector-level conditions. AI-based collector-level monitoring detects voltage excursions at the 220 kV level in real time, correlates them with transmission-level state, and flags the divergence before protective relays cascade. This is not autonomous control. It is high-speed anomaly detection integrated into existing SCADA systems, providing operators with seconds to minutes of warning that current monitoring architectures miss entirely.

How does dynamic line rating implementation work and what capacity increase is realistic?

Dynamic Line Rating replaces conservative static ratings (based on worst-case weather assumptions) with real-time thermal capacity calculations using actual conductor temperature, wind speed, solar radiation, and ambient conditions. Proven deployments show consistent results: National Grid in Syracuse achieved 20-30% average capacity increase across four 115 kV lines. AES in Indiana/Ohio saw 61% capacity increase on 345 kV lines and 25% on 69 kV lines. Duquesne Light reported up to 25% increases.

The economics are compelling: DLR costs 5-7% of traditional transmission upgrades and deploys in weeks rather than years. The AES case study showed $0.39M for DLR versus $1.63M for reconductoring, a 76% cost reduction. FERC Order 1920 now requires transmission planners to evaluate GETs including DLR before approving traditional construction.

The challenge is not the sensor technology (LineVision, Ampacimon, and others have mature hardware). The challenge is the analytics layer: identifying which corridors yield the highest capacity unlock for queued generation, predicting seasonal rating windows for planning studies, handling seams where adjacent utilities rate the same corridor differently, and integrating DLR data into existing transmission planning workflows that were designed around static ratings.

Can physics-informed neural networks actually replace PSS/E for grid stability analysis?

Not yet for production-grade control, and anyone claiming otherwise is overstating the technology. PINNs embed physical laws (swing equations, Kirchhoff's laws) into neural network training, which produces models that respect grid physics rather than just learning statistical patterns from data. Academic benchmarks show PINN-based solvers running 80-90x faster than conventional numerical methods on small test systems (IEEE 9-bus, 39-bus).

The problem is scaling. PJM has 90,000+ buses. The loss function balancing problem (data fidelity vs. physics residual vs. boundary conditions) remains an active research challenge with no commercial solution as of April 2026. Publications grew from fewer than 10 in 2019 to 820 in 2025, but commercial deployments are zero.

Where PINNs deliver value today is in planning-stage advisory simulation, not real-time control. Running 10,000 N-1/N-2 contingency scenarios in hours instead of months gives planning engineers substantially better coverage of the failure space. The models flag which contingencies deserve detailed PSS/E analysis rather than replacing PSS/E entirely. We build PINN-based advisory tools that accelerate planning studies and contingency screening. We do not build autonomous grid controllers, and we are skeptical of anyone who claims they do.

What does EU AI Act compliance mean for grid operators deploying AI?

The EU AI Act classifies AI systems used as safety components in critical infrastructure management, including electricity supply, as high-risk. The compliance deadline is August 2, 2026. Penalties reach EUR 15 million or 3% of global annual turnover.

For grid operators, this covers AI used in load forecasting and dispatch, automated fault detection and isolation, grid management and real-time optimization, and any system whose failure could cause physical damage to infrastructure. High-risk classification triggers specific requirements: conformity assessment before deployment, risk management system covering the full AI lifecycle, data governance requirements for training and validation datasets, technical documentation sufficient for third-party audit, human oversight mechanisms ensuring operators can intervene, and post-market monitoring for performance degradation.

In practice, grid operators already running AI tools for demand forecasting or DER management need to audit whether those systems qualify as safety components. The definition turns on whether failure or malfunction could result in physical damage. A demand forecast that feeds into dispatch decisions likely qualifies. A customer service chatbot does not. Most grid operators have not started structured compliance work. The challenge is that grid AI systems often evolved from research projects or vendor add-ons without the documentation rigor that conformity assessment requires.

How do you integrate AI with existing GE Vernova or Siemens SCADA systems without ripping them out?

Grid operators have decades of investment in GE Vernova GridOS, Siemens Spectrum Power, or ABB SCADA/EMS systems. Replacing them is not realistic, and it is not necessary. We build AI analytics layers that sit alongside existing SCADA/EMS, consuming the same data feeds through standard protocols (IEC 61850 for substation automation, ICCP/TASE.2 for inter-control-center communication, CIM IEC 61970/61968 for data modeling).

The integration architecture is read-only in Phase 1: our systems consume SCADA telemetry and state estimator outputs without writing back to the control system. This eliminates the certification burden of a system that issues control commands. The analytics run on separate compute infrastructure (cloud or on-premise, depending on the utility's NERC CIP posture) and surface results through operator dashboards that integrate into existing control room workflows.

The NERC CIP-013 supply chain risk management process adds 6-12 months to vendor qualification. We account for this in project timelines and provide the documentation package that utility security teams need for evaluation.

What does a grid AI engagement actually cost and how long does it take?

Costs depend on scope and the utility's data maturity. A DLR analytics optimization engagement for a utility with existing sensor deployments typically runs $200K-$500K over 3-6 months, covering corridor prioritization, seasonal rating analysis, and integration with planning workflows. An interconnection queue intelligence build for an ISO/RTO is larger: $500K-$1.5M over 6-12 months, including NLP screening models, topological clustering, and automated pre-screening tools calibrated against the ISO's historical queue data.

Collector-level observability systems for post-blackout resilience range from $300K-$800K depending on the number of monitored substations and integration complexity with existing SCADA. A full grid AI compliance assessment (EU AI Act, NERC CIP) for existing AI deployments runs $150K-$400K over 2-4 months.

These are custom builds, not license fees. Each engagement produces a system the utility owns and operates. For comparison: a single PJM capacity auction costs ratepayers $16.4 billion. DLR deployment that defers one major transmission project saves $50M-$500M. Queue intelligence that accelerates even a small percentage of viable projects into the market saves billions in capacity procurement costs.

Technical Research

The research behind this solution page. These interactive whitepapers provide the full technical depth on physics-informed grid AI, interconnection queue analysis, and post-blackout resilience engineering.

Interconnection Bottlenecks Cost PJM Consumers $3.5 Billion in a Single Auction

Queue intelligence, DLR optimization, and resilience analytics that pay for themselves in the first planning cycle.

Whether you are an ISO processing a 200+ GW queue, a utility evaluating DLR for FERC Order 1920 compliance, or a European operator building post-blackout resilience, we build the AI systems your grid software does not provide.

Grid AI Assessment

  • • Data maturity audit and SCADA integration analysis
  • • DLR corridor prioritization and capacity modeling
  • • EU AI Act / NERC CIP compliance gap assessment
  • • Interconnection queue diagnostic and optimization roadmap

Custom Grid AI Build

  • • Queue screening and topological clustering platform
  • • PINN-accelerated contingency simulation
  • • Collector-level observability and anomaly detection
  • • Vendor-neutral integration with GE/Siemens/ABB SCADA