A cantilevered balcony shown simultaneously as a photorealistic rendering (left half) and as a structural force diagram revealing hidden failure (right half), capturing the article's core tension between appearance and physics.
Artificial IntelligenceStructural EngineeringMachine Learning

I Asked GPT-4 If a Balcony Was Safe. It Said Yes. Physics Said It Would Collapse.

Ashutosh SinghalAshutosh SinghalMarch 10, 202614 min read

There's a rendering on my desk — a printout, actually, because I wanted to stare at it without a screen between us — of a cantilevered balcony. Clean lines, parametric railing, the kind of thing you'd see in an architecture magazine spread about "the future of urban living." I fed the image to GPT-4V and asked a simple question: Is this structure safe?

The response was fluent, confident, and detailed. It noted the apparent railing height, commented on the visible support conditions, and concluded that the design "appears structurally sound with adequate support."

Then I handed the same drawing to my structural engineer. She looked at it for maybe fifteen seconds. "There's no back-span reinforcement," she said. "The moment at the fixed end exceeds the section capacity. This falls."

The AI saw pixels. She saw physics. And that gap — between what looks safe and what is safe — is the reason I started Veriprajna.

The Seduction of "Good Enough"

I need to be honest about something. When multimodal LLMs first started processing engineering drawings, I was excited. Genuinely excited. I remember sitting in our small office late one evening, running blueprint after blueprint through early GPT-4V access, watching it describe structural elements with surprising vocabulary. "Steel I-beam," it would say. "Reinforced concrete column." It sounded like it understood.

That excitement lasted about three weeks.

The turning point was a test we ran on connection details — the joints where beams meet columns, where the actual load transfer happens. We gave the model a series of drawings where some connections were properly detailed and others had subtle but critical flaws: missing stiffener plates, undersized welds, discontinuous load paths. The kind of things that separate a building that stands from one that doesn't.

The model's accuracy on identifying these flaws was essentially random. It could name the components. It could describe what it saw. But it couldn't reason about whether the forces would actually flow from point A to point B. It was like asking someone who'd memorized the names of every bone in the human body to perform surgery.

An AI that can name every structural element but can't trace a load path isn't an engineering tool. It's a liability with a confident voice.

Why Do LLMs See Blueprints as Pixel Soup?

Side-by-side comparison showing how a Vision Transformer processes a structural drawing as a grid of pixel patches (losing physics) versus how a graph representation preserves actual structural relationships and physical properties.

Here's what's actually happening under the hood, and it matters even if you're not technical.

When GPT-4V or Gemini "looks" at a structural drawing, it uses something called a Vision Transformer. The model chops the image into a grid of small patches — typically 16×16 pixels each — and processes them as a sequence, similar to how it processes words in a sentence. It learns statistical associations between patches. A patch with a vertical line (column) tends to appear near a patch with a horizontal line (beam). Over millions of training images, these correlations get baked in.

But here's the critical distinction: correlation is not causation. The model learns that columns and beams tend to appear together. It does not learn that the beam is supported by the column. It doesn't know that if you remove the column, the beam falls. It has no internal physics engine. It has pattern statistics.

Research from NeurIPS demonstrated something that should alarm anyone thinking about deploying these models for safety-critical work: when you scramble the pixel patches of an image — literally shuffle them like a deck of cards — Vision Transformers often maintain high classification accuracy. They're not reading the spatial structure. They're reading texture and local patterns.

In engineering, spatial structure is everything. A connection detail that's "mostly there" but missing a critical load path isn't 90% safe. It's 100% unsafe.

What Happens When You Actually Benchmark LLMs on Structural Reasoning?

I kept hoping the benchmarks would prove me wrong. They didn't.

The DSR-Bench study evaluated ten state-of-the-art LLMs across 4,140 problem instances designed to test structural reasoning — the ability to understand and manipulate complex relationships between entities. This is exactly what you need to analyze a building frame: trace relationships through multiple nodes, satisfy strict constraints, reason about spatial configurations.

The best frontier model scored 0.498 out of 1.0 on challenging instances. Essentially a coin flip.

The failure modes were specific and damning. Multi-hop reasoning — tracing a relationship through several intermediate nodes, which is literally what load path analysis requires — was a consistent weakness. And performance degraded when problems were described in natural language compared to formal code, suggesting the models were pattern-matching syntax from their training data rather than actually reasoning.

I remember the team meeting where we reviewed these numbers. One of my engineers, who'd been cautiously optimistic about using LLMs as a first-pass screening tool, went quiet for a long time. Then he said: "So when an engineer describes a non-standard structural problem in plain English, the model is basically guessing half the time." That was the moment the room shifted. Not gradually — all at once.

Separately, the DesignQA benchmark found that multimodal LLMs could answer "What is the maximum allowed deflection?" (extracting a number from documentation) but failed at "Does this specific beam design meet the maximum allowed deflection?" (applying that number to a visual). Extraction versus application. Knowing the rule versus enforcing it.

I wrote about this failure mode in much more depth in the interactive version of our research, including the bizarre material selection biases we found — LLMs recommending titanium and carbon fiber for contexts that clearly called for standard structural steel, simply because exotic materials dominate the "high-tech" corners of their training data.

The Moment We Stopped Trying to Fix LLMs

There was an investor meeting — I won't say which firm — where someone looked at our early research and said, "Why don't you just fine-tune GPT for structural engineering? Seems like the faster path."

I understood the logic. Take the dominant paradigm, specialize it, ship it. But I'd been staring at this problem long enough to know that fine-tuning a probabilistic model to do deterministic work is like fine-tuning a poet to do arithmetic. You can get them to produce numbers. You cannot get them to guarantee the numbers are right.

The laws of physics are not probabilistic. If the sum of forces on a structural element doesn't equal zero, the element accelerates. There's no "usually" about it. There's no confidence interval. The Euler-Bernoulli beam equation doesn't care about your training data distribution.

So we made a decision that felt contrarian at the time and feels obvious now: we abandoned the image entirely.

Not the AI — the image. We stopped trying to make neural networks understand blueprints as pictures. Instead, we started converting buildings into what they actually are: mathematical graphs.

A building is not an image. It's a network of forces. The moment you treat it as pixels, you've already lost the physics.

How Do You Turn a Building Into a Graph?

Annotated diagram showing the transformation pipeline from a simple structural frame into a mathematical graph, with labeled node features and edge properties.

A graph, in the mathematical sense, is just nodes and edges. Nodes are things; edges are connections between things.

In our system, every structural component — beam, column, slab, wall — becomes a node. But unlike a pixel, which carries only color data, each of our nodes carries a rich feature vector: Young's Modulus (how stiff the material is), Moment of Inertia (how the cross-section resists bending), Yield Strength (when the material breaks). The actual physical parameters you need to calculate whether something stands or falls.

Every physical connection between components becomes an edge. An edge between a beam and a column captures the connection stiffness — is it a rigid moment connection or a simple pin? — and the relative orientation. These aren't learned approximations. They're extracted directly from BIM (Building Information Modeling) data, where the connectivity is explicitly defined.

This representation has a property that matters enormously: permutation invariance. The physics of a building doesn't change if you reorder the list of beams in the database. Graph Neural Networks respect this. Transformer-based LLMs, which process sequences, are sensitive to input order. It sounds like a technical detail, but it's the difference between an architecture that's aligned with the problem and one that's fighting it.

We built a pipeline that takes IFC files — the standard format for BIM data — and converts them into computation graphs. Where an LLM would try to "read" the blueprint image and guess at connections, our parser captures connectivity with 100% fidelity because the IFC schema defines it explicitly. No guessing. No "it looks like these elements are connected." They either are or they aren't.

The Part Where We Taught Neural Networks Physics

Here's where it gets interesting, and where I think we're doing something genuinely different.

Standard machine learning works like this: show the model lots of examples, let it learn patterns, hope it generalizes. The problem in structural engineering is that "hope it generalizes" is not an acceptable safety standard.

Physics-Informed Neural Networks — PINNs — take a fundamentally different approach. Instead of asking the AI to discover physics from data, we embed the governing equations directly into the network's loss function. The loss function is the thing the network is trying to minimize during training — it's the definition of "wrong" that drives learning.

In a standard neural network, "wrong" means "your prediction doesn't match the training data." In a PINN, we add a second definition of "wrong": "your prediction violates the laws of physics."

Take the Euler-Bernoulli beam equation, which governs how a beam deflects under load. When our network predicts a deflection shape for a structural element, we use automatic differentiation to compute the physical residual — essentially asking, "Does this predicted deflection satisfy the differential equation of static equilibrium?" If it doesn't, the physics loss term spikes, and the network is forced to correct itself.

The network literally cannot learn a solution that violates Newton's laws. Not "probably won't." Cannot.

I remember the first time we got this working on a non-trivial structure. We'd been struggling for weeks with convergence issues — the physics loss and the data loss were fighting each other, and the network was oscillating. My lead ML engineer had been sleeping in the office (I told him not to; he ignored me). Then one morning he called me over to his screen. The predicted deflection curves had snapped into alignment with the FEM (Finite Element Method) solution. Not approximately. The R² value was 0.9999.

We'd built something that had the speed of AI and the precision of traditional engineering solvers. Recent research on Graph-Structured Physics-Informed DeepONets — the architecture class we build on — has demonstrated 7–8x speedups over traditional FEM while maintaining that level of accuracy. For the full technical breakdown of our architecture and benchmarks, including the math behind our message-passing framework, I've published a detailed research paper.

Can You Actually See Where a Building Will Fail?

Side-by-side comparison of load path streamlines through a cantilevered structure — one showing continuous safe flow to the foundation, the other showing abrupt termination at a missing connection, illustrating how graph-based analysis reveals failure modes.

This is the question engineers care about most, and it's where graph-based analysis becomes viscerally powerful.

In our system, we don't just check whether a structure passes or fails as a whole. We trace the Principal Load Path — the route that forces take from the point of application (say, people standing on a balcony) down through the structure to the foundation.

We do this using a metric called the U* Index, which maps internal strain energy transfer and relative rigidity between points. Using Runge-Kutta integration on the U* gradient, we draw "streamlines" of force through the structure — like a weather map, but for loads instead of wind.

When a structure is safe, the streamlines flow continuously from the loaded element down to the foundation. When it's not — when there's a missing connection, an undersized member, a discontinuous load path — the streamlines terminate abruptly or diverge wildly.

Back to that balcony rendering on my desk. When we ran it through our graph pipeline, the load path streamline from the cantilever slab simply... stopped. There was no back-span connection to carry the moment into the supporting structure. The U* contour showed a massive strain energy concentration at the fixed end with nowhere to go. The visualization made the failure mode obvious in a way that no amount of pixel analysis ever could.

A load path streamline that terminates is a sentence the structure is writing about its own death. You just have to know how to read the graph.

We can also simulate progressive collapse — what happens when you remove a column and ask "does the rest of the structure hold?" — by systematically deleting nodes from the graph and re-evaluating connectivity. Using measures like Betweenness Centrality, we identify critical clusters of components whose failure would split the graph into disconnected pieces. This "graph attack" simulation runs in seconds. The equivalent nonlinear FEM collapse analysis takes hours. We can screen thousands of failure scenarios before an engineer finishes their coffee.

Why Not Just Use Both? The Verifier Layer

People always push back on this point. "Ashutosh, generative AI is incredible for early-stage design. You can't just ignore it." And they're right — I don't want to ignore it. Architects using tools like Midjourney or parametric generators to explore creative concepts is genuinely exciting. The problem isn't the generation. It's the verification.

What we've built is a Verifier Layer. The generative model proposes a design. Veriprajna converts it to a graph, checks topological connectivity, traces the load path, runs the physics-informed prediction. If the physics check fails, we return a hard constraint — not a suggestion, a constraint: "Increase beam depth by 200mm" or "Add back-span connection." The generative model regenerates within those bounds.

Creativity constrained by physics. Imagination verified by math. That's the workflow.

And because our models are constrained by physics equations rather than trained on the entire internet, they're remarkably data-efficient. A PINN trained on steel frames generalizes to new steel frames because Hooke's Law doesn't change between projects. This also means the models are small enough to deploy on-premise. No client needs to send blueprints of sensitive infrastructure to a public API.

The Glass Box vs. The Black Box

There's one more thing that keeps me up at night about LLM-based engineering tools, and it's not accuracy — it's explainability.

When a Graph Neural Network makes a prediction about a structural element, we can visualize exactly which neighboring nodes influenced that prediction through attention weights. "The column was flagged because the combined load transferred from Beam A and Beam B exceeded its capacity." That's a traceable, auditable reasoning chain. An engineer can look at it and say, "Yes, that's correct" or "No, you've miscounted the tributary area." They can argue with the model.

Try arguing with GPT-4's reasoning about a structural assessment. Ask it why it concluded the balcony was safe. You'll get a fluent paragraph that sounds reasonable but maps to nothing you can verify. The reasoning is distributed across billions of parameters in ways that no human can inspect.

In software, a black box is a design choice. In structural engineering, a black box is an abdication of responsibility.

The Foundation Question

I've been in enough conference rooms and investor meetings to know that the current AI hype in construction is almost entirely about generative models. The pitch decks are gorgeous. The demos are impressive. The underlying assumption — that you can pixel-predict your way to structural safety — is wrong.

The construction industry is unique among all industries in one critical way: our bugs kill people. A software bug is a patch. A structural bug is a collapse investigation, a lawsuit, a memorial. The margin for "probably right" is zero.

We built Veriprajna on graph theory, geometric deep learning, and differential equations because those are the only foundations that offer deterministic answers to safety questions. Not "it looks safe." Not "based on similar structures in our training data, this is likely adequate." But: the physics residual is zero, the load path is continuous, the stress is within capacity.

GPT-4 told me that balcony was safe because it had seen thousands of photos of balconies, and in those photos, the pixels of the floor usually stayed above the pixels of the ground. Physics told me it would collapse because the bending moment at the fixed end exceeded the moment capacity of the section.

I know which one I'm building on.

Related Research