hallucination detection

The Full Data foundation Behind HDM-2's Hallucination Detection Success Over GPT-4o: From Training Data, Evals, Validation, to Benchmark datasets

DataFramer, born out of AIMon Labs, built the full data foundation behind AIMon's' HDM-2 — training data, evaluation sets, validation pipelines, and HDM-Bench — powering an open-source hallucination detection model that beat GPT-4o and GPT-4o-mini across every major benchmark.

HDM-Bench: How a 3B Model Outperformed GPT-4o at Hallucination Detection

DataFramer Team

Tue Apr 15

DataFramer built the full data foundation behind AIMon Labs’ HDM-2 — training data, evaluation sets, validation pipelines, and the HDM-Bench benchmark — powering an open-source hallucination detection model that beat GPT-4o and GPT-4o-mini across every major benchmark it was tested on.

F1 Score on RAGTruth85.0
Model Parameters3B
GPT-4o (est. parameters)~200B
Inference latency on L4 GPU<500ms

Background

At DataFramer, we believe the bottleneck for the next generation of AI models isn’t compute — it’s data quality. The story of AIMon Labs’ HDM-2 model is a concrete proof point. When their team set out to build an enterprise-grade hallucination detection model, they needed a data partner who could own the entire data lifecycle — from generating training examples to building evaluation sets to designing the benchmark itself. That’s where we came in.

The result — HDM-2-3B, now open-sourced on HuggingFace — outperformed GPT-4o and GPT-4o-mini on hallucination detection tasks, and did so at a fraction of the compute cost. This article tells the story from our vantage point: what we built, why it mattered, and what it made possible.

The Problem: Hallucination Remains Unsolved at Enterprise Scale

Despite years of research, hallucination in large language models remains one of the most persistent and costly failure modes in production AI. Even the latest frontier models from OpenAI, Google, and Anthropic self-report hallucination rates approaching 20% in certain evaluation settings.

The standard industry response has been to use a large general-purpose model — typically GPT-4o — as an AI judge to evaluate the outputs of other models. This works, but it’s expensive, slow (often several seconds per query), inconsistent across prompt variations, and introduces a circular dependency on the very models you’re trying to validate.

The core tension: Enterprises need hallucination detection that runs in real-time, costs pennies per call, and doesn’t rely on the same models it’s trying to evaluate. A specialized, lightweight model trained on high-quality domain-specific data is the natural answer — but only if the data pipeline behind it is robust enough at every stage: training, validation, and evaluation.

AIMon Labs understood this clearly. Building HDM-2 wasn’t just a modeling challenge — it was fundamentally a data challenge. They needed a partner who could generate the right training data, build the validation sets to iterate against, and design a benchmark rigorous enough to prove the model worked. That full-stack data problem is what DataFramer was built to solve.

Our Contribution: The Full Data Pipeline

DataFramer’s involvement with AIMon Labs spanned every stage of the data lifecycle — training, evaluation, validation, and benchmarking. This wasn’t a narrow dataset contribution; it was the data foundation that made HDM-2 possible.

Training Data

The model needed to learn what hallucinations look like across a wide range of enterprise contexts — not just clean academic examples, but the messy, subtle deviations that appear in real production RAG pipelines. DataFramer generated domain-specific synthetic training data covering Finance, Healthcare, Legal, and Insurance scenarios, with phrase-level hallucination annotations that gave the model the fine-grained signal it needed to learn detection at the token level.

Evaluation & Validation Sets

Iterating toward a production-grade model requires held-out evaluation sets that are genuinely independent from training data, and validation pipelines that expose specific failure modes rather than just tracking aggregate metrics. DataFramer built these evaluation and validation sets in parallel with the training data — ensuring the AIMon team had a reliable feedback loop at every stage of development.

HDM-Bench: The Public Benchmark

The public-facing output of this collaboration is HDM-Bench — an open-source benchmark dataset hosted under the DataFramer HuggingFace organization and central to the HDM-2 research paper. HDM-Bench is not a standard true/false factual recall dataset. It is a phrase-level, multi-domain hallucination benchmark built from the ground up for the way hallucinations actually appear in enterprise RAG pipelines: not as obvious fabrications, but as subtle deviations from grounding context — a wrong number here, an unsupported claim there, an enterprise-specific assertion that can’t be verified against public knowledge.

What makes HDM-Bench different:

1. Domain Coverage — Samples span Finance, Healthcare, Legal, and Insurance — the highest-stakes domains where hallucination has real business and regulatory consequences.

2. Phrase-Level Annotation — Every hallucinated span is annotated at the character level — not just flagged at the sentence or document level — enabling token-level model training and evaluation.

3. Taxonomy-Aligned Labels — Labels align with HDM-2’s novel response taxonomy: context-based hallucinations, common knowledge violations, and innocuous statements are each tagged distinctly.

4. Two-Pass Human Review — Every example went through a stacked two-reviewer process — first pass annotation, second pass quality check — to maximize label reliability and minimize noise.

The result is 1,320 carefully curated examples across two distinct data splits — a 1,120-row synthetic split generated by DataFramer, and a 199-row mr split. For a specialized evaluation benchmark designed to stress-test a detection model’s precision and recall in the most difficult edge cases, quality and diversity matter far more than volume.

What HDM-2 Achieved

Trained on DataFramer’s data and evaluated against HDM-Bench, HDM-2 set new performance standards across every dataset it was tested on. Here are the headline results:

Hallucination Detection F1 Scores Across Benchmarks

ModelRAGTruth F1TruthfulQA F1HDM-Bench F1
GPT-4o (as judge)63.4~72~68
GPT-4o-mini (as judge)~58~67~62
LLaMA-2-13B (fine-tuned)78.7
HDM-2-3B (DataFramer data)85.083.773.6

The RAGTruth result is particularly striking: HDM-2 achieves an F1 of 85.03 against GPT-4o’s 63.4 — more than 20 points better, on a model estimated to be roughly 60–70x larger by parameter count. On TruthfulQA, HDM-2 outperforms GPT-4o-mini on every metric including precision, recall, and F1.

The key failure mode that HDM-Bench was specifically designed to expose is recall. A model that flags nothing scores 0% recall but avoids false positives — useless in practice. GPT-4o and GPT-4o-mini both show weaker recall scores, meaning they systematically miss real hallucinations. HDM-2, trained and iteratively validated against DataFramer’s phrase-level annotations, closes this gap significantly.

Why the Full Data Stack Matters

The success of HDM-2 is a case study in what becomes possible when model architecture and data pipeline are co-designed end to end. The HDM-2 team built a novel multi-task architecture with separate context-grounding and common-knowledge verification modules. But that architecture can only be trained well, evaluated honestly, and optimized reliably if the data at every stage — training, validation, evaluation, and benchmark — provides the right granularity of signal.

A training dataset that lacks domain diversity produces a brittle model. A validation set that isn’t independent of training produces false confidence. A benchmark that only labels entire responses as “hallucinated” or “not hallucinated” tells you nothing about where detection logic breaks down. DataFramer solved all three simultaneously — which is why the results look the way they do.

The data flywheel: High-quality training data builds a capable model → rigorous validation exposes failure modes → targeted fixes improve performance → a credible benchmark proves it works → and the cycle repeats. DataFramer is built to power every stage of this flywheel.

What This Means for Enterprise AI Teams

HDM-2 is now open-sourced on HuggingFace under a CC BY-NC-SA license, and HDM-Bench is publicly available for any team to use as an evaluation baseline. For enterprise AI teams building RAG pipelines, the practical implications are significant:

Real-time guardrails become economically viable. At sub-500ms inference on a single L4 GPU, HDM-2 can be deployed inline — flagging hallucinations before responses reach end users, not as a post-hoc audit.

You no longer need GPT-4o to judge GPT-4o. A specialized 3B model trained on purpose-built data can outperform a ~200B generalist on this specific task, at a fraction of the API cost and latency.

The data pipeline is the competitive moat. The teams that invest in rigorous, domain-specific training data, validation sets, and evaluation benchmarks will build better models faster than those relying on public datasets alone. HDM-2 demonstrates what that looks like in practice.

At DataFramer, we partner with AI teams who understand that the path to a better model runs through better data — at every stage. Training data, evaluation sets, validation pipelines, and benchmarks are not separate concerns; they are a single, interconnected system that determines what your model can and cannot do.

The HDM-2 story is one we’re proud to have built from the ground up. If your team is facing a similar challenge — whether you’re fine-tuning, evaluating, or trying to prove your model works in production — we’d like to talk.

HDM-Bench dataset available on HuggingFace · HDM-2 model on HuggingFace · Research paper: arXiv 2504.07069

Get started

Ready to build better AI with better data?

The benchmark data is not just an evaluation artifact — it shapes what the model learns to care about. If your benchmark is shallow, your model will be too.