How a 3B Model Outperformed GPT-4o on Hallucination Detection: The Training, Evals, Validation, and Benchmark Synthetic Data Pipeline Behind HDM-2

DataFramer built the full data foundation behind AIMon Labs’ HDM-2 — training data, evaluation sets, validation pipelines, and the HDM-Bench benchmark — powering an open-source hallucination detection model that outperformed GPT-4o on hallucination detection benchmarks.


F1 Score on TruthfulQA	83.7
Model Parameters	3B
GPT-4o (est. parameters)	~200B
Inference latency on L4 GPU	<500ms

Background

We believe the bottleneck for the next generation of AI models isn’t compute — it’s data quality. The story of AIMon Labs’ HDM-2 model is a concrete proof point. When their team set out to build an enterprise-grade hallucination detection model, they needed a partner who could own the entire data lifecycle: from generating training examples to building evaluation sets to designing the benchmark itself. That’s where we came in.

HDM-2-3B, an open-source hallucination detection model, outperformed GPT-4o and GPT-4o-mini on hallucination detection tasks and did so at a fraction of the compute cost. The model and HDM-Bench have together crossed 7,500 downloads on HuggingFace. This article tells the story from our vantage point: what we built, why it mattered, and what it made possible.

The Problem: Hallucination Remains Unsolved at Enterprise Scale

Despite years of research, hallucination in large language models remains one of the most persistent and costly failure modes in production AI. Even the latest frontier models from OpenAI, Google, and Anthropic self-report hallucination rates approaching 20% in certain evaluation settings.

The standard industry response has been to use a large general-purpose model, typically GPT-4o, as an LLM-as-a-judge for hallucination evaluation. LLM judges like GPT-4o can work, but they are expensive, slow (often several seconds per query), inconsistent across prompt variations, and introduce a circular dependency on the very models you are trying to validate.

The core tension: Enterprises need hallucination detection that runs in real-time, costs pennies per call, and doesn’t rely on the same models it’s trying to evaluate. A specialized, lightweight model trained on high-quality domain-specific data is the natural answer, but only if the data pipeline behind it is robust enough at every stage: training, validation, and evaluation.

AIMon Labs understood this clearly. Building HDM-2 wasn’t just a modeling challenge. It was fundamentally a data challenge. They needed a partner who could generate the right training data, build the validation sets to iterate against, and design a benchmark rigorous enough to prove the model worked. That full-stack data problem is what we set out to solve.

Our Contribution: The Full Data Pipeline

Our involvement with AIMon Labs spanned every stage of the data lifecycle: training, evaluation, validation, and benchmarking. This wasn’t a narrow dataset contribution. It was the data foundation that made HDM-2 possible.

Training Data

The model needed to learn what hallucinations look like across a wide range of enterprise contexts, not just clean academic examples, but the messy, subtle deviations that appear in real production RAG pipelines. RAG evaluation at enterprise scale requires exposing these subtle deviations in context-grounded responses, not just obvious fabrications. We generated domain-specific synthetic training data covering Finance, Healthcare, Legal, and Insurance scenarios via supervised fine-tuning on an SFT dataset with phrase-level ground truth labels that gave the model the fine-grained signal it needed to learn detection at the token level.

Evaluation & Validation Sets

Iterating toward a production-grade model requires held-out evaluation sets that are genuinely independent from training data, and validation pipelines that expose specific failure modes rather than just tracking aggregate metrics. We built these ground truth datasets and evaluation datasets in parallel with the training data, ensuring the AIMon team had a reliable feedback loop at every stage of development.

HDM-Bench: The Public Benchmark

The public-facing output of this collaboration is HDM-Bench, an open-source benchmark dataset hosted on HuggingFace and central to the HDM-2 research paper. HDM-Bench is an open-source hallucination detection benchmark and golden dataset, not a standard true/false factual recall dataset. It is phrase-level and multi-domain, built from the ground up for the way hallucinations actually appear in enterprise RAG pipelines: not as obvious fabrications, but as subtle deviations from grounding context. A wrong number here, an unsupported claim there, an enterprise-specific assertion that cannot be verified against public knowledge.

What makes HDM-Bench different:

1. Domain Coverage: Samples span Finance, Healthcare, Legal, and Insurance, the highest-stakes domains where hallucination has real business and regulatory consequences.

2. Phrase-Level Annotation: Every hallucinated span is annotated at the character level as part of the dataset annotation process, not just flagged at the sentence or document level, enabling token-level model training and evaluation.

3. Taxonomy-Aligned Labels: Labels align with HDM-2’s novel response taxonomy: context-based hallucinations, common knowledge violations, and innocuous statements are each tagged distinctly.

4. Two-Pass Human Annotation: Every example went through a stacked two-reviewer process with subject matter expert review: first pass annotation, second pass quality check, to maximize label reliability and minimize noise. This human-in-the-loop evaluation ensures ground truth labels reflect real expert judgment rather than automated assumptions.

The result is 1,320 carefully curated examples across two distinct data splits: a 1,120-row synthetic split we generated, and a 199-row mr split. For a specialized evaluation benchmark designed to stress-test a detection model’s precision and recall in the most difficult edge cases, quality and diversity matter far more than volume.

What HDM-2 Achieved

Trained on our data and evaluated against HDM-Bench, HDM-2 set new performance standards across every LLM benchmark it was tested on. Here are the headline results:

Hallucination Detection Leaderboard: F1 Scores Across LLM Benchmarks

Model	RAGTruth F1	TruthfulQA F1	HDM-Bench F1
GPT-4o (as judge)	—	53.8	58.7
GPT-4o-mini (as judge)	—	56.2	57.7
LLaMA-2-13B (fine-tuned)	78.7	—	—
HDM-2-3B	85.0	83.7	73.6

On TruthfulQA, HDM-2 achieves an F1 of 83.7 against GPT-4o’s 53.8 and GPT-4o-mini’s 56.2, a 27-30 point advantage on every metric including precision, recall, and F1. On RAGTruth, HDM-2 reaches 85.03 F1, more than 6 points ahead of the next best fine-tuned model (LLaMA-2-13B at 78.7), on a model with a fraction of the parameters.

HDM-2, trained and iteratively validated against our phrase-level annotations, closes this gap significantly.

Why the Full Data Stack Matters

The success of HDM-2 is a case study in what becomes possible when model architecture and data pipeline are co-designed end to end. The HDM-2 team built a novel multi-task architecture with separate context-grounding and common-knowledge verification modules. But that architecture can only be trained well, evaluated honestly, and optimized reliably if the data at every stage (training, validation, evaluation, and benchmark) provides the right granularity of signal.

A training dataset that lacks domain diversity produces a brittle model. A validation set that isn’t independent of training produces false confidence. A benchmark that only labels entire responses as “hallucinated” or “not hallucinated” tells you nothing about where detection logic breaks down. We solved all three simultaneously, which is why the results look the way they do.

High-quality training data builds a capable model. Rigorous validation exposes failure modes. Targeted fixes improve performance. A credible benchmark proves it works. Then the cycle repeats. DataFramer is built to power every stage of that loop.

What This Means for Enterprise AI Teams

HDM-2, a fine-tuned open-source hallucination detection model, is available on HuggingFace under a CC BY-NC-SA license, and HDM-Bench is publicly available for any team to use as an evaluation baseline. For enterprise AI teams building RAG pipelines, here is what that means in practice:

Real-time guardrails become economically viable. At sub-500ms inference on a single L4 GPU, HDM-2 can be deployed inline for real-time LLM monitoring and observability, flagging hallucinations before responses reach end users rather than as a post-hoc audit. It acts as an LLM verifier running continuously in your serving path.

You no longer need GPT-4o as an LLM-as-a-judge for GPT-4o outputs. A specialized 3B model trained on purpose-built data can outperform a ~200B generalist on this specific task, at a fraction of the API cost and latency.

The data pipeline is the competitive moat. The teams that invest in rigorous, domain-specific training data, validation sets, and evaluation benchmarks will build better models faster than those relying on public datasets alone. HDM-2 demonstrates what that looks like in practice.

DataFramer works with AI teams who understand that the path to a better model runs through better data at every stage. Training data, evaluation sets, validation pipelines, and benchmarks are not separate concerns. They are a single system that determines what your model can and cannot do.

The HDM-2 story is one we’re proud to have built from the ground up. If your team is facing a similar challenge: fine-tuning, evaluating, or trying to prove your model works in production, we’d like to talk.

HDM-Bench dataset available on HuggingFace · HDM-2 model on HuggingFace · Research paper: arXiv 2504.07069