A striking visual showing the contrast between a blurry pixel-level freeze frame and a precise physics-based measurement system, specific to football officiating technology.

Artificial IntelligenceTechnologyFootball

VAR Doesn't Ruin Football. Bad Engineering Does.

Ashutosh Singhal March 7, 202615 min read

I was in a bar in Bangalore when Luis Díaz's goal got chalked off.

It was November 2023, Liverpool vs. Tottenham. The ball hit the net, Anfield erupted, and then — silence. The VAR check. The freeze frame. The line drawn from some pixel on Díaz's shoulder to some pixel on the last defender's boot. Offside. Except it wasn't. The Premier League later admitted the goal should have stood. A "significant human error," they called it.

The guy next to me — a software engineer, not even a football fan — looked at the screen and said something that stuck with me: "Why are they drawing lines on a blurry photo like it's 2005?"

He was right. And not just about that call. The entire VAR offside system is built on a physics mistake so fundamental that I'm genuinely surprised more engineers haven't screamed about it. I run Veriprajna, where we build deep sensor fusion systems — the kind of architecture where you fuse data from multiple sensor types into a single model of reality. When I started pulling apart how VAR actually works under the hood, I didn't find a system that needed tweaking. I found a system that cannot work, not because of bad software, but because of bad physics.

The offside problem isn't a software bug. It's a measurement crisis disguised as a technology success story.

The Pixel Fallacy: Why Cameras Lie About Where Players Are

Here's what most people don't realize about a video frame: it's not a photograph of a frozen instant. It's a smear.

A broadcast camera in the Premier League runs at 50 frames per second. That means it captures one image every 20 milliseconds. During each capture, the shutter is open for roughly 10 milliseconds to let in enough light. In that 10 milliseconds, a sprinting player's foot — moving at 20 meters per second during a kicking motion — travels about 20 centimeters. The "image" of that foot on the sensor isn't a crisp point. It's a blur spanning dozens of pixels.

Now here's where it gets absurd. The VAR operator takes this blurry frame, zooms in, places a single-pixel crosshair on what they believe is the "leading edge" of the attacker's toe, and draws a line. They're picking one point inside a probability distribution and calling it truth.

A broadcast frame doesn't capture where a player is. It captures a probability cloud of where they might have been during a 10-millisecond window.

But the temporal problem is even worse than the spatial one. A professional kick — the moment boot meets ball — happens in about 8 to 12 milliseconds. At 50 frames per second, the camera might catch one frame before contact and the next frame after the ball has already left the foot. The actual instant of the kick almost never appears on screen. The operator picks the "closest" frame, but "closest" can mean 10 milliseconds off. In those 10 milliseconds, players moving at a combined relative speed of 14 meters per second have shifted position by 14 centimeters.

So the system draws a millimeter-precise line on an image that is physically outdated by a distance ten times larger than the margin it's claiming to measure. This isn't measurement. It's theater.

When I Ran the Numbers Myself

A clear error budget comparison showing the stacked sources of positional uncertainty in the current VAR system, with specific centimeter values for each error source.

I didn't start this project to fix football. I started it because the math offended me.

My team at Veriprajna works on sensor fusion — combining data from cameras, accelerometers, gyroscopes, and other instruments into a unified model of physical reality. We do this for industrial applications where precision matters. When I first looked at the VAR pipeline as an engineering system, I expected to find something sophisticated underneath the controversy. Maybe the public just didn't understand the tolerances. Maybe the error margins were acceptable.

Instead, I found a system with a total zone of uncertainty of 30 to 40 centimeters trying to make calls at the centimeter level.

I sat down one evening and laid out the error budget on a whiteboard. Temporal quantization from frame selection: ±10 milliseconds, which at 14 m/s relative velocity gives ±14 cm of positional uncertainty. Motion blur during the shutter opening: another ±10 cm. Rolling shutter distortion on CMOS sensors — where the image is read line-by-line, top to bottom, so a fast-moving leg appears geometrically warped: unquantified but real. Add the pixel-level ambiguity of placing a keypoint on a blurred limb, and you're looking at a combined error that dwarfs any offside margin under about 40 centimeters.

I remember staring at that whiteboard and thinking: every "tight" offside call in the last five years has been a coin flip dressed up as science.

That was the moment I decided we had to write the full technical analysis. Not to complain about VAR, but to show what a real measurement system would look like.

Why Can't You Just Use "Better AI" on the Same Cameras?

This is the question I get most often, usually from investors and sometimes from other AI companies. "Can't you just train a better model on the broadcast feed?"

No. And the reason reveals a deeper problem in how the sports tech industry works right now.

The market is flooded with what I call wrapper solutions — companies that take a standard broadcast feed, run it through an off-the-shelf object detection model like YOLO or Mask R-CNN, and output bounding boxes or pose estimates. These are fine for fan engagement features, highlight reels, basic analytics. They are fundamentally unsuited for officiating.

A wrapper inherits the limitations of its input. If your input is a 50fps broadcast feed with motion blur, rolling shutter artifacts, and lens distortion, no neural network — no matter how many parameters — can recover temporal information that was never captured. You can't hallucinate physics. The data simply isn't there.

This is the distinction I keep trying to make when people ask what "Deep AI" means to us. It doesn't mean a deeper neural network. It means going deeper in the stack — controlling the sensor layer, the data acquisition pipeline, the time synchronization infrastructure. We don't process video. We engineer the conditions under which data is captured so that the inputs are actually capable of supporting the precision we need.

You cannot fix a measurement problem with a better algorithm. You fix it with a better instrument.

What Would a Real System Look Like?

An architecture diagram showing the two-stream sensor fusion system — ball IMU for timing and high-speed cameras for spatial tracking — converging into a single fused reconstruction.

So my team and I designed one. Not a tweak to VAR. A replacement for the entire measurement architecture.

The core insight is deceptively simple: decouple the measurement of time from the measurement of space. Let the ball tell you when the kick happened. Let the cameras tell you where the players were. And use mathematics to fuse those two streams into a single, precise reconstruction of reality.

The Ball Knows When It's Kicked

We propose embedding a 500Hz Inertial Measurement Unit — an accelerometer and gyroscope sampling 500 times per second — in the center of the match ball. When a boot strikes the ball, the accelerometer registers a massive spike in G-force with a characteristic waveform: sharp rise time under 2 milliseconds, rapid decay as the ball leaves the foot. This is distinct from a bounce (lower magnitude, longer contact) or a header (softer curve due to skull compliance).

By analyzing the spectral signature of the impact, the system identifies the exact onset of ball deformation — the physical instant of "first contact" as the laws of the game define it. The timestamp precision: ±1 millisecond. Compare that to the ±10 milliseconds of manual frame selection.

One thing we argued about internally for weeks: the sensor has to handle ±200g of acceleration. A professional strike generates forces that would instantly saturate a consumer-grade accelerometer at ±16g, clipping the data and destroying the waveform. The sensor also has to sit at the ball's exact center of mass, suspended on tensioned filaments inside the bladder, so the ball flies true. Any deviation and you've built a loaded die. The engineering constraints are severe, but they're solvable — FIFA's own connected ball technology at the 2022 World Cup proved the concept is viable.

The Cameras See Where Everyone Is

For the spatial layer, we replace broadcast cameras with 12 to 16 fixed-position, calibrated machine vision cameras running at 200 frames per second with global shutters.

The frame rate increase matters enormously. At 200fps, the inter-frame interval drops from 20 milliseconds to 5 milliseconds. The "blind spot" — the maximum distance a player can move between frames — shrinks from 28 centimeters to 7 centimeters. But the bigger win is motion blur. At 200fps, the shutter speed must be 1/1000th of a second or faster. The blur smear drops from 10–20 centimeters to under 1 centimeter. Players become crisp, measurable objects instead of probability clouds.

Global shutters matter too. Broadcast cameras use rolling shutters that read the image line by line. A fast-moving leg gets geometrically distorted — elongated or compressed depending on its direction relative to the readout. Global shutter sensors expose every pixel simultaneously. The geometry is preserved exactly as it existed at the moment of exposure.

And because these are fixed, calibrated cameras with overlapping fields of view, we can triangulate every player's 3D position using multi-view stereo geometry. When a limb is occluded in one camera angle — blocked by a defender in a crowded penalty box — it's almost certainly visible from another angle. Our system uses a voting mechanism: visible keypoints from unobstructed cameras contribute to the reconstruction, occluded views are discarded. If a joint is partially hidden in all views, biomechanical constraints (a shin connects to a knee connects to a hip) allow inference with a calculated confidence interval.

How Do You Fuse Two Different Sensors Into One Truth?

This is where the real engineering lives, and honestly, where I think Veriprajna's deepest contribution is.

You have skeletal tracking data at 200Hz and ball impact data at 500Hz. The kick happens at, say, timestamp 1234 milliseconds. The nearest camera frames are at 1230ms and 1235ms. You need to know where the striker's toe was at exactly 1234ms. You can't just pick the closest frame — that's a 1-millisecond error, which at 14 m/s is still 1.4 centimeters. For a system claiming sub-centimeter precision, that's unacceptable.

So we interpolate. But not with a straight line — human motion is curvilinear. A sprinting leg accelerates and decelerates through its stride. We use cubic spline interpolation, which constructs a smooth curve through the known data points while preserving continuity in velocity and acceleration. The result is a mathematically generated "virtual frame" — the reconstructed position of every player's skeleton at the exact millisecond of contact.

Before interpolation, we run the raw tracking data through an Unscented Kalman Filter. This is a mathematical framework that maintains a state model for each joint on every player's body — position, velocity, acceleration — and continuously reconciles what the physics predicts with what the cameras observe. If the neural network's detection jitters by a few centimeters frame-to-frame (which it always does), the filter smooths it out by trusting the physics. If the player makes a sudden cut, the filter increases trust in the optical measurement. The result is a clean, biomechanically consistent trajectory.

The critical architectural choice: tight coupling versus loose coupling. In a loosely coupled system, the vision system and the IMU each calculate positions independently, then you average them. Simple, but fragile — if the cameras lose a player behind a wall of defenders for 50 milliseconds, the average becomes meaningless. In our tightly coupled architecture, the raw residuals from both sensor streams feed into a single factor graph optimizer that solves for the most likely state satisfying all constraints simultaneously. Even during partial occlusion, the kinematic momentum established by the Kalman filter carries the estimate forward with high confidence until visual lock is reacquired.

We don't measure pixels. We reconstruct the physics of the moment and read the answer from the model.

For the complete mathematical framework — the Kalman filter state equations, the quaternion orientation estimation, the homography transforms — I've published the full technical deep-dive here.

What Happens to the Error Budget?

A direct side-by-side comparison of the current VAR system versus the proposed sensor fusion system, showing the dramatic difference in total uncertainty.

Let me put the two systems side by side, because the contrast is stark.

Current VAR at 50Hz with manual frame selection: temporal error of ±10ms, spatial uncertainty of ±14cm from frame selection alone, ±10cm from motion blur. Total zone of uncertainty: roughly 30 to 40 centimeters.

Our architecture — 200Hz optical, 500Hz inertial, tightly coupled fusion: the IMU pins the kick to ±1ms. Cubic spline interpolation over a 5ms camera gap introduces sub-millimeter error for smooth biological motion. The remaining dominant error source is the neural network's keypoint placement accuracy — about ±2 to 3 centimeters. Total zone of uncertainty: roughly 2 to 3 centimeters.

That's an order of magnitude improvement. Decisions that were previously "too close to call" — where the margin fell inside the system's blind spot — become mathematically distinct.

"But This Would Be Incredibly Expensive"

It would cost real money, yes. Sixteen high-speed cameras, edge computing clusters with dual A100 or H100 GPUs in the stadium server room, a fiber-optic PTP backbone for sub-microsecond time synchronization, IMU-embedded match balls. This is not a cloud SaaS product you deploy with an API key.

But let me reframe the cost question. The Premier League generates over £3 billion annually in broadcast revenue. A single wrong offside call can swing a title race, trigger relegation worth hundreds of millions in lost revenue, and erode the trust of a global audience. The infrastructure I'm describing would cost a fraction of what a single major club spends on transfers in a window.

The real resistance isn't cost. It's institutional inertia. Football's governing bodies bought into VAR as a finished product. Admitting it needs fundamental re-engineering — not just better operators or thicker tolerance lines — means admitting the original promise was oversold. Nobody wants to have that conversation.

People also ask me: what happens if the ball sensor fails mid-match? The system degrades gracefully to optical-only mode. At 200fps, the error margin increases to about 7 centimeters — still dramatically better than the current 28-centimeter blind spot. The match continues without interruption.

And what about the "scuffed" pass — a dribble where the foot maintains continuous contact with the ball? The IMU detects continuous vibration instead of a sharp spike, and the system switches logic to track the moment of release, when the vibration ceases. We've thought through these edge cases because they're the ones that would actually break a deployed system.

This Isn't Really About Offside

Once you build a sensor fusion architecture with this level of fidelity, offside is just the first application. The same 3D skeletal data and high-frequency ball tracking enable automated handball detection — modeling the "natural silhouette" as a volumetric boundary in 3D space and detecting arm movements toward the ball trajectory that exceed what the torso rotation implies. The same Kalman velocity derivatives that track player position can calculate the exact G-force of every step and deceleration event, flagging the cumulative knee loads that precede ACL tears before they happen.

The stadium becomes a digitized physics laboratory. And the sport becomes, for the first time, genuinely measurable.

The Uncanny Valley of Officiating Technology

There's a concept from robotics called the uncanny valley — the point where something is almost human-like enough to be convincing but just off enough to be deeply unsettling. VAR lives in the uncanny valley of measurement technology. It's precise enough to make us believe it's capturing truth, but imprecise enough to routinely get it wrong. That gap — between the appearance of certainty and the reality of uncertainty — is what drives fans insane.

The people who say "VAR ruins the game" aren't being emotional. They're responding to a real phenomenon: a system that presents guesses as facts. The pixel-precise lines, the freeze frames, the clinical graphics — they all project an authority the underlying physics cannot support.

The solution isn't to go backward. Nobody wants to return to the days when a linesman's split-second glance decided a World Cup semifinal. The solution is to go deeper. To stop measuring pixels and start measuring physics. To build instruments worthy of the claims we're making.

Football doesn't need less technology. It needs technology that respects the physics of the sport it's trying to govern.

We don't need thicker tolerance lines or more forgiving protocols. We need a system that actually captures what happened — with sensors fast enough, precise enough, and fused tightly enough to reconstruct the truth of a moment that lasts 8 milliseconds and decides everything.

That's what we're building. Not because we think technology should replace human judgment in football. But because when technology does intervene, it should at least be right.