A stylized visualization showing a human silhouette mid-squat with its hip joint trajectory traced as a clean sinusoidal waveform, bridging the physical and signal-processing domains.

Artificial IntelligenceHealth TechnologyStartups

Your Fitness App Can't Tell If You're Lying — And That's a Billion-Dollar Problem

Ashutosh Singhal February 21, 202616 min read

Last November, I watched a demo that broke something in my brain.

A corporate wellness vendor was pitching their "AI-powered fitness platform" to a room full of insurance executives. The demo was slick — a user on screen doing squats, the app counting reps, awarding points, the whole gamified package. The executives nodded approvingly. Then I asked a question that made the room go quiet: "What happens if the user just bobs up and down three inches instead of actually squatting?"

The vendor smiled. "Well, we trust our users to—"

"You trust them," I said. "But you're asking this insurer to price risk based on that trust."

That was the moment I knew we were building the right thing at Veriprajna. Not another chatbot. Not another wrapper around GPT. Something the industry desperately needed but hadn't figured out how to articulate: AI that doesn't generate answers — it verifies physical reality.

The fitness and digital health industry has a dirty secret. The $60 billion corporate wellness market, the insurance discount programs, the move-to-earn crypto projects — they're all built on data that dissolves the moment you audit it. And nobody wants to talk about it because the dashboards look great.

I'm going to talk about it.

Why Can't Your Fitness App Tell If You Actually Worked Out?

Here's the architecture of almost every fitness app on the market: it's a video player with a recommendation engine bolted on top. You press play, an instructor does pushups, you're supposed to follow along, and when the video ends, the app logs the workout as "complete." It estimates your caloric burn from generic tables. It gives you a badge.

At no point did the app verify that you moved.

The app assumes that consumption equals completion. It asks, "Did you do the work?" and uncritically accepts "Yes" as the answer.

This isn't a niche complaint. This is the foundational architecture of a multi-billion-dollar industry. And it fails for a reason that any behavioral economist could have predicted.

There's a principle called Campbell's Law — the sociologist Donald Campbell observed that the more you use a metric for decision-making, the more people will corrupt that metric. Attach money to step counts, and people strap Fitbits to ceiling fans. Attach insurance discounts to "workout completion," and people let videos play while they eat dinner.

This isn't hypothetical. Remember STEPN, the move-to-earn crypto project? It collapsed partly because of the arms race between the protocol's ability to detect valid movement and users' ability to fake it with GPS spoofing and mechanical shakers. When verification is weak, fraud becomes rational. Honest participants get punished. The incentive provider goes broke.

I kept coming back to a line I eventually started saying in every pitch meeting: You cannot gamify what you cannot verify.

The Night We Realized Pose Estimation Isn't Intelligence

We didn't start with this insight. We stumbled into it.

Early on, my team was excited about pose estimation — libraries like BlazePose and MoveNet that extract skeletal joint coordinates from video. We thought, great, we'll use these to build a fitness verification system. We spent weeks integrating MoveNet, getting clean skeleton data streaming from a phone camera, and then we sat down to actually use the data for verification.

That's when the arguments started.

One of my engineers, convinced we were almost done, pulled up a single frame of skeleton data — a person with bent elbows and a lowered torso. "See? Pushup," he said.

"Is it?" I asked. "Are they going down or coming up? Have they been holding that position for thirty seconds or thirty milliseconds? Are they trembling with fatigue or perfectly controlled?"

A single frame tells you nothing. A skeleton coordinate at one moment in time is semantically void. It's like handing someone a raw voltage reading from an ECG and asking for a cardiac diagnosis. The sensor provides data. Intelligence interprets the signal.

We had built a very good sensor. We had built zero intelligence.

That was a rough week. We'd been so focused on the computer vision piece — getting clean joint coordinates — that we'd confused the prerequisite with the solution. And every competitor in the space was making the same mistake, marketing pose estimation as "AI-powered fitness" when it was really just a fancy ruler.

I wrote about this paradigm shift — from vision to signal processing — in more depth in the interactive version of our research. But the core realization was simple and it changed everything we built afterward.

What If the Human Body Is a Radio Signal?

Here's the reframe that unlocked our entire approach.

When a person does squats, the vertical position of their hip joint traces a wave over time. Not metaphorically — literally. It's a sinusoidal signal. Jumping jacks produce a periodic waveform in shoulder angular velocity. Walking generates complex multi-harmonic signals across the lower body.

The human body, performing repetitive exercise, is a mechanical oscillator.

Once you see it that way, you stop thinking about computer vision and start thinking about signal processing. Suddenly you have access to a completely different mathematical toolkit:

Amplitude tells you the depth of the squat
Frequency tells you the cadence
Phase tells you whether the left and right sides are coordinated
Spectral purity tells you whether the movement is controlled or shaky

We're no longer asking an AI to "guess" what exercise is happening. We're measuring the physics of a waveform. The question shifts from "What does this look like?" to "What does this measure?"

We reframed Human Activity Recognition not as an image classification problem, but as a Digital Signal Processing problem. That single decision made verification possible.

But raw signal processing — Fourier Transforms and the like — is brittle when applied to real human movement. People change speeds. Camera angles shift. Arms occlude legs. You need deep learning to handle the noise. The question was: which architecture?

Why We Threw Out LSTMs

If you've taken any machine learning course in the last decade, you learned that sequential data means recurrent neural networks. LSTMs — Long Short-Term Memory networks — were the gold standard. Text, audio, time series — everything went through an LSTM.

We tried. It didn't work. Not in the way we needed it to.

The problems were fundamental, not fixable with hyperparameter tuning. LSTMs process data sequentially — to compute what's happening at frame 100, you must first process frames 1 through 99. On a mobile phone running in real time, that serial bottleneck creates latency that kills the user experience. You can't tell someone "go lower" two seconds after they've already come back up.

Worse, LSTMs forget. Their "memory" degrades over long sequences. A five-minute yoga set or a fifty-rep pushup challenge generates thousands of frames, and by the end, the model has lost the context from the beginning. We saw this as drift — the model's confidence in its own counting would erode over time, like a person losing count in their head.

There was a team meeting where we laid out the numbers. The latency was unacceptable. The memory was unreliable. The computational cost of running LSTMs on thousands of concurrent enterprise streams was prohibitive. Someone said, "Maybe we need to rethink the whole architecture."

Someone else said, "Maybe we need convolutions."

That person was right.

How Does a Temporal Convolutional Network Actually Work?

Diagram showing how dilated causal convolutions exponentially expand the receptive field across layers, allowing the TCN to see both immediate frames and long-term context simultaneously.

Temporal Convolutional Networks — TCNs — take the convolutional architecture that revolutionized image recognition and apply it to the time domain. Instead of sliding a filter across pixels in an image, you slide it across time steps in a signal. But two design choices make TCNs radically different from anything that came before.

First: causal convolutions. The network at time t only looks at data from time t and earlier. It never peeks into the future. This sounds obvious, but it's a mathematical guarantee that matters enormously for real-time verification. We're not retroactively deciding whether a rep was valid after the set is over — we're verifying it as it happens.

Second, and this is the part that still excites me: dilated convolutions. Instead of looking at adjacent time steps, the network introduces spacing between the points it examines. And that spacing grows exponentially with each layer. Layer 1 sees adjacent frames. Layer 2 skips one. Layer 3 skips three. By layer 10, a single filter captures a window of 512 frames.

This means the network can simultaneously attend to what's happening right now — is the knee collapsing inward on this specific frame? — and what's been happening over the last three minutes — is the movement periodicity degrading in a way that suggests fatigue?

A TCN with dilated convolutions sees both the instantaneous physics of a single frame and the long-term temporal context of an entire workout. No other architecture gives you both at once.

And because convolutions are parallel operations, not sequential ones, the whole thing runs fast enough for real-time mobile inference. Training is faster too — no exploding gradients, no vanishing gradients, just stable backpropagation through a fixed-depth network.

For the full technical breakdown — including the comparative performance data against LSTMs and the mathematics of our signal analysis — see our research paper.

Counting Reps Without Knowing What a Rep Is

One of our early design decisions was controversial, even within the team.

Most fitness apps that attempt rep counting train a specific model for each exercise. A "pushup counter." A "squat counter." A "bicep curl counter." This means every new exercise requires new training data, new labeling, new deployment. It's brittle and it doesn't scale.

We went a different direction. We built a class-agnostic counting system based on temporal self-similarity. The idea: if a movement is repetitive, the signal will be similar to itself at regular intervals. You don't need to know what the exercise is. You just need to detect that the signal is repeating.

The TCN maps the skeletal pose sequence into a compressed representation, then we compute the similarity between every pair of time steps. Repetitive action shows up as a distinct visual pattern — parallel lines of high similarity. The distance between those lines is the rep duration. The intensity of the lines tells you how consistent the reps are.

This works for squats, kettlebell swings, rowing, jumping jacks, or any rehab movement a physical therapist invents next Tuesday. We detect the physics of repetition itself, not the identity of the exercise.

I'll admit there was a moment of doubt. An investor told me, "Just use GPT-4 with video input. It can count pushups." I asked him to try it with someone doing quarter-reps at variable speed while a toddler walked through the frame. He stopped bringing it up.

What Happens When You Measure Form, Not Just Count?

Side-by-side comparison showing what a traditional fitness app logs per workout versus what a Veriprajna verified rep contains as a data packet.

Counting is necessary but nowhere near sufficient. Someone can do fifty "pushups" with one-inch range of motion. The counter goes up. The physics says nothing happened.

We built three metrics that turn a rep count into a quality assessment.

Depth. We track the trajectory of key joints — the hip during a squat, the chest during a pushup — and apply peak detection to the TCN-filtered signal. A rep is only valid if the displacement exceeds a biomechanical threshold. This isn't an opinion. It's a measurement of how far the joint actually traveled.

Control. In signal processing, "jerk" is the third derivative of position — the rate of change of acceleration. High jerk means tremors, instability, or using momentum to cheat the movement. We calculate a normalized version called Log Dimensionless Jerk. A high score means the person is struggling or flinging themselves through the rep. In rehab and corporate wellness, this is a leading indicator of injury risk.

Symmetry. We compare the signal energy and phase between left and right sides. An asymmetry index reveals when someone is favoring one leg during a squat — often a precursor to injury or a sign of incomplete rehabilitation. This metric is impossible to self-report. You can't feel a 12% asymmetry. But the signal can measure it.

A "Veriprajna Verified Rep" isn't a checkbox. It's a data packet containing timestamp, skeletal keypoint hash, TCN confidence score, and kinematic telemetry — depth, speed, jerk, symmetry. It's auditable. It's immutable. It's the difference between a claim and evidence.

The Privacy Architecture That Made Enterprise Clients Say Yes

I need to address something people always ask me: "You're analyzing people doing exercises on camera. How is this not a privacy nightmare?"

It would be, if we were streaming video to the cloud. We don't.

The phone runs a lightweight pose estimator on its Neural Processing Unit. This extracts skeletal coordinates — just numbers representing joint positions. A few kilobytes of data. The video frames are discarded immediately. No pixel data ever leaves the device. What gets transmitted to our cloud engine (or processed on-device for high-end phones) is anonymous kinematic data. Numbers. Not faces.

This is GDPR and HIPAA compliance by architecture, not by policy. The biometric data — the video of someone's face and body — is never stored, never transmitted, never at risk. This wasn't an afterthought. We designed the entire system around this constraint because we knew enterprise clients wouldn't touch anything else.

Who Pays for Physics?

The economics of verified movement are staggering once you see them.

Insurance. Insurers currently offer discounts for gym memberships, which verify location, not effort. With verified functional movement data — five squats, five lunges, a balance hold — an insurer can assess stability, range of motion, and symmetry. These correlate strongly with fall risk in seniors and general metabolic health. Dynamic underwriting based on verified functional capacity, not static actuarial tables. The insurer who figures this out first wins the market.

Corporate wellness. A $60 billion industry where companies pay for outcomes they can't measure. Employees shake phones for step targets and claim Health Savings Account contributions. With verified active minutes, the barrier to fraud becomes physical effort. To fake a pushup on our system, you'd essentially need to build a humanoid robot — or just do the pushup.

Tele-rehab. Musculoskeletal disorders are a top cost driver for employers. Home exercise adherence is notoriously below 50%, and when patients do exercise, they often use poor form that delays recovery. A TCN monitoring prescribed joint angles gives clinicians a dashboard of verified compliance and quality trends. Remote Therapeutic Monitoring is now a reimbursable CPT code in the US — this isn't speculative. It's a revenue stream.

Move-to-earn, done right. The Web3 fitness projects failed because GPS is trivially spoofable. We provide the oracle for physical effort. Token minting gated by TCN verification creates an economy where supply is capped by the physical capacity of the user base, not the creativity of the cheaters.

"But Won't LLMs Eventually Do This?"

I hear this constantly. The assumption that because large language models keep getting better, they'll eventually solve everything, including physical verification.

They won't. And the reason is architectural, not a matter of scale.

LLMs are designed to produce the most likely next token. They're probabilistic. They generate plausible output. In creative and administrative domains, that's incredibly useful. But in physical verification, plausibility is the enemy. A medical diagnosis, a rehabilitation protocol, an insurance premium adjustment — these cannot be based on what's probably happening. They must be grounded in what's actually happening.

No amount of scaling changes the fundamental objective function. An LLM with a trillion parameters is still optimizing for likelihood, not truth. Our TCN is optimizing for the physics of a waveform — amplitude, frequency, phase, spectral purity. These are measurements, not predictions.

The other question I get: "Can't you just fine-tune a vision-language model on exercise videos?" You can. It will tell you "this looks like a pushup." It will not tell you that the left shoulder is carrying 15% more load than the right, that the jerk profile indicates early fatigue onset, or that the rep depth has degraded by 8% over the last two minutes. It will give you a label. We give you a signal analysis.

The AI industry is obsessed with generation. We're obsessed with verification. These are not the same discipline, and conflating them is how you end up pricing insurance premiums on hallucinations.

The Line Between Vibes and Physics

I think about this a lot: the entire digital health industry is sitting on one side of a line, and most of it doesn't realize the line exists.

On one side is what I call the Vibes Economy. Self-reported data. Step counts from devices that can be shaken. Workout completions from videos that can be ignored. Dashboards that look encouraging. Data that feels correct. It works until someone audits it, and then it evaporates.

On the other side is what we're building: the Physics Economy. Verified movement. Measured displacement. Quantified control. Auditable assets. Data that survives scrutiny because it was never based on trust in the first place.

The transition between these two economies is not incremental. You don't get 60% of the way to physics by adding a step counter to your video player. You either measure the waveform or you don't. You either verify the rep or you take the user's word for it.

Every enterprise we talk to — every insurer, every corporate wellness buyer, every tele-rehab platform — eventually arrives at the same realization. They've been paying for vibes and calling it data. The moment they see what verified movement data actually looks like, they can't unsee it.

I started Veriprajna because I believed the most important AI problem of this decade isn't generating better text. It's verifying physical reality. Every month that passes, every new LLM wrapper that launches, every fitness app that ships another video player with a badge system — I become more certain.

The future of health AI isn't smarter chatbots. It's honest measurement. And physics doesn't hallucinate.