A clean, editorial illustration showing a knowledge graph structure overlaid on the concept of recruitment — connecting skills to roles with visible, traceable paths, contrasting transparency against opacity.
Artificial IntelligenceHiringMachine Learning

Amazon Built an AI Recruiter That Taught Itself to Hate Women. I Built One That Can't.

Ashutosh SinghalAshutosh SinghalFebruary 11, 202612 min read

In 2014, a team of machine learning engineers in Edinburgh sat down to solve recruiting at Amazon scale. Feed the system 100 resumes, get back the top five, ranked one to five stars — like rating products. Elegant. Efficient. And within three years, they discovered the system had taught itself that being female was a disqualifying characteristic.

The AI penalized resumes containing the word "women's" — as in "Women's Chess Club Captain." It downgraded graduates of two all-women's colleges. Not because anyone told it to. Because when you train a model on ten years of hiring data from a male-dominated industry, "being male" becomes, statistically, one of the strongest predictors of "being hired."

I remember reading the Reuters exposé when it broke. I was already deep into building knowledge graph systems at Veriprajna, and my first reaction wasn't shock — it was recognition. I'd been arguing for months that statistical correlation engines had no business making decisions about human potential. The Amazon story wasn't an anomaly. It was a mathematical inevitability. And it radicalized me into believing that the entire architectural approach to AI recruitment was broken — not at the edges, but at the foundation.

The Problem Isn't the Bias. It's the Architecture.

Here's what most people get wrong about the Amazon debacle: they think the engineers were careless. They weren't. They were some of the best ML engineers on the planet. When they discovered the gender bias, they tried to fix it. They explicitly programmed the model to ignore gender-specific terms. And the model found workarounds.

This is the concept of proxy variables, and it's the thing that keeps me up at night. Deep learning models are relentless pattern-finders. Remove the word "woman" from the input, and the model latches onto sentence structure. Studies show male resumes tend to use verbs like "executed" and "captured," while female resumes lean toward more communal language. The model sees "executed" correlating with "hired" and quietly reconstructs the gender bias through linguistics alone.

Amazon's engineers couldn't surgically remove the bias without destroying the model's predictive capability. So they killed the entire project.

You cannot fix a system that discriminates by accident. You have to build one that can't discriminate by design.

That sentence has been my north star for three years. And it's the reason we built Veriprajna's recruitment engine on knowledge graphs instead of neural networks.

Why Does Every AI Recruiter Eventually Learn to Discriminate?

I need you to understand something about how deep learning works in recruitment, because the failure mode is counterintuitive.

A neural network doesn't understand what "Python" means. It doesn't know Python is a programming language useful for data science. It only knows that the string "Python" appeared frequently in the resumes of people who got hired. If "Lacrosse" also appeared frequently — maybe because of socioeconomic correlations between certain sports and certain schools that feed into certain companies — the model might weigh "Lacrosse" as heavily as "Python."

This is correlation masquerading as intelligence. The model doesn't reason about cause and effect. It finds patterns and optimizes for them. And here's the insidious part: bias amplification means these models don't just replicate historical biases — they exaggerate them. If men were 60% of the workforce in the training data, the model might push toward hiring 80% or 90% men to maximize its accuracy score.

I had a conversation with a potential investor early on who told me, "Just use GPT-4 for resume screening. Everyone else is." I asked him: if you feed the same resume into GPT-4 twice, do you get the same score? He paused. The answer is no — LLMs are stochastic. They're non-deterministic. Run the same input twice, get two different outputs. In an audit scenario, that's not a quirk. That's a compliance failure.

The Regulatory Walls Are Closing In

This isn't theoretical anymore. Governments have seen the Amazon story and they're legislating.

NYC Local Law 144, effective since July 2023, requires any employer using an automated employment decision tool to undergo an annual independent bias audit. Not a vague "we checked for fairness" audit — a specific, quantitative one. The law mandates calculation of selection rates and impact ratios for every category of race, ethnicity, and sex. If the selection rate for a protected group divided by the rate for the most-selected group drops below 0.8 — the "four-fifths rule" — that's prima facie evidence of disparate impact.

The EU AI Act goes further. It classifies AI systems used for recruitment as High-Risk — the same category as medical devices and critical infrastructure. Article 13 demands that these systems be "sufficiently transparent to enable users to interpret the system's output." Article 14 requires human oversight — the ability to override AI decisions. But you can't meaningfully override a decision you don't understand.

And under GDPR, Article 15(1)(h) grants data subjects the right to access "meaningful information about the logic involved" in automated decisions. Recital 71 explicitly mentions the right to "obtain an explanation of the decision reached."

Try explaining a neural network's decision. Go ahead. "Neuron 4,502 fired at intensity 0.8" is not a meaningful explanation. Neither is "the model determined you were a 73% match" with no further detail.

The gap between technical complexity and the legal requirement for simple explanation is the central crisis of modern HR Tech.

I wrote about this regulatory landscape in more depth in the interactive version of our whitepaper, which walks through exactly how each regulation applies to different AI architectures.

What If the AI Couldn't See Gender at All?

This is where I need to tell you about the night everything clicked for me.

We'd been experimenting with different approaches to debiasing — adversarial training, counterfactual augmentation, the usual toolkit. And I was sitting in our office at 11 PM, staring at a graph visualization on my screen, when I had one of those obvious-in-retrospect realizations: we were trying to teach the model to ignore bias. What if we built an architecture where bias literally couldn't enter the reasoning engine?

In a knowledge graph, data is stored as nodes (entities) and edges (relationships). A Person node connects to Skill nodes. Skill nodes connect to other Skill nodes through semantic relationships. The graph knows that "PyTorch" is a library for "Deep Learning," which is a subset of "Artificial Intelligence." So if a job requires "AI experience" and a candidate lists "PyTorch," the graph traces the path and finds a match — even without the keyword "AI" appearing anywhere on the resume.

Here's the critical architectural decision: when our matching algorithm runs, it operates on a restricted subgraph. This inference graph contains Skills, Roles, Experience levels, and Certifications. It explicitly excludes nodes for Name, Gender, Ethnicity, Address, and graduation dates.

The bias isn't suppressed. It's structurally severed. There is no path from "Candidate" to "Gender" to "Role" because the Gender node doesn't exist in the graph the algorithm can see.

Compare this to a deep learning model, which ingests the entire raw text. Even if you remove the "Gender" field, the model reads "Women's Chess Club" and infers gender. In our system, the LLM that parses the resume maps "Women's Chess Club" to a neutralized node: (:Activity {type: "Strategy Club", role: "Leadership"}). The gendered modifier is stripped before it enters the reasoning engine.

I remember the team argument about this. One of my engineers pushed back hard — he thought we were losing valuable signal by stripping context. "What if the Women's Chess Club is actually more competitive than the regular one?" Fair point. But we weren't optimizing for maximum information extraction. We were optimizing for fairness under legal scrutiny. And I'd rather miss a marginal signal than build a system that learns to penalize half the population.

How Do You Actually Measure Talent Without Bias?

A labeled knowledge graph snippet showing how skills connect semantically, with a concrete example of the Docker-to-Kubernetes path and the skill distance scoring concept.

We don't predict who will succeed. We measure skill distance — the geometric gap between what a candidate has and what a job requires. This moves recruitment from subjective probability to objective measurement.

Traditional applicant tracking systems use Boolean logic: does the resume contain the keyword "Java"? Yes or no. This is brittle and stupid. It misses anyone who uses different terminology for the same competency.

We use graph embeddings — algorithms like Node2Vec that learn a vector representation for every skill in our ontology. Skills that frequently co-occur in the graph (like "Python" and "Pandas") end up close together in vector space. Skills that are unrelated (like "Python" and "Phlebotomy") end up far apart.

To score a candidate, we calculate cosine similarity between the candidate's skill vector set and the job's requirement vector set. This gives us partial credit. A candidate who lacks "Tableau" but has "Power BI" gets a high similarity score because those nodes are semantic neighbors in the "Business Intelligence" cluster. A keyword search would give them a zero.

We layer on Jaccard similarity for raw skill overlap and geodesic distance — shortest-path calculations through the graph — for gap analysis. If a job requires Kubernetes and a candidate has Docker, the graph finds the path: Docker → Containerization → Orchestration → Kubernetes. Distance: 3 hops. Interpretation: trainable. If the distance is 6+ hops, it's a hard gap.

The final skill distance score is a purely competency-based metric, completely blind to demographics. We don't guess who's good. We measure how close they are.

For the full technical breakdown of these algorithms — including the math behind cosine similarity and our composite scoring model — see our research paper.

The "Missing SQL" Moment

Let me make this concrete with something that happened during testing.

We ran a candidate profile through both a standard black box recruiter and our system. The black box rejected the candidate. No reason given. (We later determined the candidate attended a small, lesser-known college — a classic pedigree penalty.)

Our system returned this: "Candidate lacks explicit SQL experience. However, graph analysis shows extensive experience with Pandas DataFrames and R dplyr. Graph distance between DataFrames and SQL is short (shared concept: Data Manipulation). Recommendation: Interview. High transferability."

That candidate — the one the black box threw away — had every skill the job needed. They just used different words for it. And they went to a school the black box hadn't seen enough of in its training data to consider "successful."

This is what I mean when I say knowledge graphs expand the talent pool. They find people who have the competencies but not the pedigree or the exact vocabulary. And that naturally improves diversity — not through quotas or adjustments, but through better measurement.

What Happens When the System Flags a Problem?

People ask me: "What if your system still produces biased outcomes?" It's a fair question, and I'd be suspicious of anyone who claimed their system was perfect.

Here's the difference: when a black box produces biased outcomes, you're stuck. You can see the disparate impact in the numbers, but you can't see why. Is it the university names? The zip codes? The writing style? You're debugging a system with millions of parameters and no legible logic.

When our system produces a statistical anomaly — say, an impact ratio below 0.8 for a particular demographic group — we can trace it. We can identify the specific graph nodes causing the disparity. Maybe a job description requires a particular expensive certification that correlates with socioeconomic status. We can see that, flag it, and the hiring team can decide whether that certification is truly necessary or just a legacy requirement nobody questioned.

The glass box doesn't mean the system is always right. It means when it's wrong, you can find out why and fix it.

The LLM Still Has a Job — Just Not the Important One

Architecture diagram comparing how data flows through a black box neural network vs. Veriprajna's knowledge graph system, showing where bias enters and where it's structurally blocked.

I should be clear: we use LLMs. We're not Luddites. But we use them the way you'd use a translator — for reading and writing, not for judging.

Our architecture enforces a strict separation of concerns. The LLM handles perception: it reads unstructured resume text and extracts entities. "I orchestrated a team of 5 developers to build a React Native app" becomes structured data — Skill: React Native, Skill: Team Leadership, Context: Mobile Development. The LLM normalizes synonyms: "ReactJS" and "React.js" both map to the same node.

But the LLM never makes a hiring decision. All matching, scoring, and ranking happens through deterministic graph traversal. Same graph plus same query equals same result, every time. We also use the LLM at the output end — it generates human-readable explanations, but only from graph-verified facts. It can't hallucinate a skill match that the graph doesn't support.

I think of it as the LLM being the eyes and mouth of the system, while the knowledge graph is the brain. You wouldn't let your mouth make decisions for you. (Well, most of us wouldn't.)

What Are We Really Choosing Between?

The way I see it, the industry is at a fork. One path leads to bigger models, more parameters, more opacity — and an endless game of whack-a-mole with bias that keeps finding new proxy variables to exploit. The other path leads to structured reasoning, semantic measurement, and systems that can explain themselves to a regulator, a recruiter, or a rejected candidate.

I've talked to HR leaders at companies still using black box screening tools. They know the risk. They've read about Amazon. But switching architectures feels expensive and uncertain, so they keep patching. They add "bias mitigation layers" on top of fundamentally biased systems. They hire consultants to run annual audits that tell them what's broken without giving them the tools to fix it.

Data is a mirror. If you train a model on the past, you replicate the past. In a world striving for equity, replicating the past is a failure condition.

I'm not going to end this with a hedge. I've spent years building this, I've seen the alternative fail spectacularly, and I'm confident in the conclusion: the future of recruitment AI is not about predicting who will succeed based on who succeeded before. It's about measuring the actual distance between what someone can do and what a job requires — and making that measurement transparent, deterministic, and structurally incapable of discrimination.

You can keep predicting the past. Or you can start measuring the future.

Related Research