Anatomy of an AI Research Agent That Doesn't Hallucinate

TL;DR

We built an autonomous AI agent that reads threat intelligence, generates security research hypotheses, and produces actionable briefs — running daily without human supervision. It produced 36 papers in 42 days. We audited every one. All 36 were smoke: inflated scores, fabricated CVE references, and known attacks repackaged as novel research. Instead of abandoning the project, we dissected the failure modes and rebuilt the pipeline with seven validation gates. The rebuilt agent now rejects derivative work, verifies evidence against live databases, and tests its own hypotheses empirically against a local sandbox. This post documents the architecture, the failure modes, and the specific engineering decisions that turned a hallucination factory into a functional research tool.

1. The Problem: AI Generating AI Security Research

Ai-EGIS is an autonomous AI red-teaming platform. Among its nine agents, the Research Agent runs daily at 07:00 UTC: it reads threat intelligence from 45+ sources, identifies patterns, generates hypotheses about emerging attack primitives, and produces technical papers or tools.

The pipeline is conceptually sound:

Threat Intel (Sentinel, ArXiv, NVD, GHSA)
  --> Hypothesis generation (Claude Opus)
  --> Validation gate
  --> Research execution (Claude Sonnet)
  --> Peer review (Claude Opus)
  --> Second validation gate
  --> Save output

After 42 days of daily autonomous operation, we had 36 papers, 70 proof-of-concept scripts, and 56 hypothesis files. Approximately 5,300 lines of research output. Self-assigned quality scores averaged 8.7 out of 10.

We then conducted a line-by-line audit of every paper, every PoC, and every claim.

2. The Audit: 36 Papers, 36 Failures

We evaluated each paper on four axes:

Axis	Question	Scale
Novelty	Does it say something genuinely new?	0–10
Evidence	Are claims backed by verifiable data?	0–10
Publication potential	Could a venue accept this?	0–10
Hallucination index	Signs of fabrication or inflation?	0–10 (10 = total smoke)

Results

Metric	Value
Papers with publication potential	0 of 36
Papers with fabricated CVE IDs	25 of 36 (69%)
Self-assigned novelty score (average)	8.7 / 10
Honest novelty score (average)	4.2 / 10
Score inflation factor	+4.5 points
Papers rehashing SSRF	5 of 36
Unique architectural insights (total)	6
PoC scripts that actually exploit anything	0 of 70

The Five Systemic Failure Modes

Failure 1: Gap Analysis Disguised as Research. Every paper followed the same template: read a Sentinel finding, map it to the test registry, declare a “coverage gap,” propose new tests. This is useful operational work — it is not research. A gap analysis based on one truncated intelligence report does not become a research paper by adding section headers and an abstract.

Failure 2: CVE Fabrication. The agent generated plausible-looking CVE identifiers (e.g., CVE-2026-5971 through CVE-2026-5974 — sequential numbering, identical CVSS scores) and constructed entire analytical frameworks on them. The two-pass peer review caught most fabrications, but the fact that the generation layer produced them at all meant every output required forensic verification.

Failure 3: Score Inflation. The agent consistently rated its own hypotheses 8–9 out of 10 on novelty, impact, and feasibility. The composite scoring threshold (7.5) was designed to filter weak hypotheses, but when every hypothesis self-assigns scores above 8, the filter admits everything. The scoring function measured the agent’s confidence, not the hypothesis quality.

Failure 4: Topic Obsession. Five papers on SSRF. Four on prompt-injection-to-RCE chains. The agent had no memory of what it had already written. Each run started fresh, and SSRF was always the most obvious pattern in the Sentinel data, so the agent rediscovered it weekly.

Failure 5: Cosmetic PoCs. Every paper included a “proof of concept” script. Every script loaded local JSON files, ran keyword searches, and printed Evidence score: 100/100, VERDICT: CONFIRMED. None of the 70 scripts sent a single request to a target system. They were data analysis scripts labeled as proofs of concept.

3. The Rebuild: Seven Gates Between Hypothesis and Output

We did not discard the agent. The underlying capability — synthesizing threat intelligence from multiple sources and identifying patterns — was genuinely valuable. The problem was not the intelligence gathering; it was the output layer’s inability to distinguish between “I found something interesting” and “I produced publishable research.”

The rebuild added seven validation gates, each targeting a specific failure mode:

Gate 1: Score Deflation

Problem: Self-assigned scores are inflated by ~4 points.

Solution: Subtract a fixed deflation factor from every self-assigned score before evaluation.

Raw score:     Novelty=9, Impact=8, Feasibility=9
Deflated (-3): Novelty=6, Impact=5, Feasibility=6
Composite:     5.3 (threshold: 4.5)

The deflation factor (3) was derived empirically from the audit: the median gap between self-assigned and honest scores across 36 papers was 4.5. We chose 3 as a conservative correction. The recalibrated thresholds:

A raw “mediocre” hypothesis (7/7/7) deflates to 4/4/4 (composite 4.0) — blocked.
A raw “good” hypothesis (8/8/8) deflates to 5/5/5 (composite 5.0) — passes.
A raw “excellent” hypothesis (9/9/8) deflates to 6/6/5 (composite 5.8) — passes comfortably.

Gate 2: Topic Deduplication

Problem: The agent wrote about SSRF five times because it had no memory of previous topics.

Solution: A persistent topic registry extracted from every successful output.

After each run, the agent extracts topic keywords from the hypothesis title and stores them in a covered_topics list. Before the next run, covered topics are injected at the top of the hypothesis generation prompt:

TOPICS ALREADY COVERED -- DO NOT REVISIT:
  x  ssrf (5x: 2026-04-29, 05-04, 05-10, 05-11, 05-18)
  x  prompt injection + rce (2026-05-02)
  x  sandbox + sandbox escape (2026-05-03)
  ...26 topics total

We seeded the initial registry by extracting topics from the 27 successful papers in the feedback history. The topic matching uses a fixed keyword vocabulary (36 terms covering the major attack classes) plus a normalized title prefix as fallback.

Gate 3: Feasibility Validation

Problem: Hypotheses reference data that does not exist in the intelligence context.

Solution: A separate Claude instance (Opus, temperature 0.0) receives the hypothesis AND the raw intelligence data, and verifies that every cited number, CVE, and finding actually appears in the data.

This gate existed before the rebuild, but its prompt was insufficiently specific about what constitutes a verification failure. The revised prompt includes explicit instructions:

Numbers in the hypothesis must appear verbatim in the data context.
CVE IDs must appear in the Sentinel findings.
Claims about domain counts and test IDs must match the registry.

Gate 4: Originality Validation

Problem: The agent reads an ArXiv paper and produces a summary of it as “original research.”

Solution: A dedicated originality gate that classifies each hypothesis into one of three categories:

Category	Definition	Verdict
DERIVATIVE	The core claim already appears in the cited sources	REJECT
NOVEL_SYNTHESIS	Combines sources in a way none of them do individually	PASS
NOVEL_EMPIRICAL	Tests something nobody has tested before	PASS

The originality prompt asks a single question: “Does the hypothesis tell the reader something they could NOT learn by reading the cited sources?”

The gate also requires the agent to articulate a novel_delta — a specific statement of what is new beyond the sources. This field is required in the hypothesis JSON and evaluated by the originality validator.

In our first test run with the originality gate active, H002 (temporal evasion attacks on autonomous agents) passed as NOVEL_SYNTHESIS:

“No individual source performs the cross-referencing that maps the empirically demonstrated temporal evasion primitive (from A3S-Bench) and the real-world variant (Two-Document Chain Injection) to the specific gap in D20/D11 coverage.”

The Ma et al. paper measures temporal evasion rates. The Sentinel finding reports two-document chain injection. Neither source says “this implies a gap in adaptive jailbreak resilience testing because fragment reassembly via stateful reasoning is structurally distinct from conversational drift.” That connection is the original contribution.

Gate 5: CVE Verification

Problem: 69% of papers contained fabricated CVE identifiers.

Solution: Before generating the brief, extract every CVE-YYYY-NNNNN pattern from the hypothesis text and verify each against the NVD REST API.

# Simplified version of the verification logic
async def verify_cves_against_nvd(text: str) -> dict:
    cve_ids = re.findall(r'CVE-\d{4}-\d{4,7}', text)
    verified, unverified = [], []
    for cve_id in cve_ids:
        resp = await client.get(NVD_API, params={"cveId": cve_id})
        if resp.json().get("totalResults", 0) > 0:
            verified.append(cve_id)
        else:
            unverified.append(cve_id)
    return {"verified": verified, "unverified": unverified}

Unverified CVEs trigger a confidence downgrade: the brief’s CONFIDENCE field is forced to LOW, and a warning is injected into the generation context telling the model not to cite unverified CVEs as confirmed evidence.

Gate 6: Content Validation

Problem: Even with a validated hypothesis, the generated brief can fabricate data or overclaim.

Solution: Post-generation validation by Claude Opus (temperature 0.0) that checks the brief content against the raw intelligence data. Any number, CVE, or claim not traceable to the provided context triggers rejection.

For briefs (vs. the old 25-page papers), we reduced this to a single review pass. The mandatory brief structure — FINDING / EVIDENCE / REGISTRY GAP / ACTION / CONFIDENCE — forces the model to make specific, verifiable claims rather than diffuse academic prose.

Gate 7: Sandbox Empirical Testing

Problem: Zero of 70 PoC scripts actually tested anything against a real system.

Solution: After generating the brief, the agent crafts 3–5 targeted payloads from the hypothesis, sends them to a local LLM (Llama 3.1 8B via Ollama), collects responses, and uses a judge to determine whether the attack was successful.

The sandbox validation produces structured empirical results:

SANDBOX VALIDATION
  Target: llama3.1:8b (local Ollama)
  Payloads: 5 sent, 2 triggered
  Verdict: PARTIALLY_CONFIRMED
  [1] TRIGGERED (conf=80%) -- indicators: leaked system prompt
  [2] BLOCKED (conf=10%) -- indicators: none
  [3] TRIGGERED (conf=72%) -- indicators: followed injected instruction
  ...

This section is appended to the brief, giving every output at least some empirical grounding. The target model (Llama 3.1 8B) is intentionally small and poorly defended — the goal is not to claim “we broke GPT-4” but to provide a minimum viable empirical signal that the attack primitive is real.

4. Supporting Infrastructure

The seven gates are the core innovation, but three supporting capabilities amplify their effectiveness:

Multi-Source Triangulation

Before hypothesis generation, all intelligence sources are cross-referenced by keyword overlap to form “corroborated clusters.” A Sentinel finding about LiteLLM SSRF that is also present as an NVD CVE, mentioned in an ArXiv paper, and listed in a GitHub Security Advisory forms a HIGH-confidence cluster (4 independent sources). A Sentinel finding with no external corroboration remains LOW-confidence.

The triangulator identified 7 clusters from 47 items in our test run, with 3 meeting the 2+ source threshold. Corroborated clusters are presented to the hypothesis generator as “STRONGLY PREFERRED” topics, while single-source items are flagged as “use only if no corroborated clusters are available.”

Full-Text Paper Fetching

The original agent read ArXiv abstracts and cited papers it had never read. The rebuilt pipeline downloads the top 3 most relevant ArXiv PDFs, extracts text via PyMuPDF, and provides up to 8,000 characters of actual paper content. This turns “paper X describes technique Y” (possibly wrong) into “paper X, on page 7, states: [direct quote]” (verifiable).

Discovery-Research Fusion

When a brief identifies a specific registry gap and the gap is confirmed by the sandbox validation, the pipeline automatically generates a test specification (TestDef + payloads) ready for integration into the test registry. Each run can now produce two outputs: a threat intelligence brief (for humans) and a test spec (for the platform). This closes the loop between research and operational capability.

5. Output Format: From 25-Page Papers to 500-Word Briefs

The most impactful change was also the simplest: we replaced the 25-page academic paper format with a 500-word structured threat intelligence brief.

THREAT INTEL BRIEF -- [title]
---
FINDING:      [1-2 sentences]
EVIDENCE:     [verified CVE IDs + Sentinel finding IDs]
REGISTRY GAP: [specific domain/test, or "already covered"]
ACTION:       [concrete test spec, or "no action needed"]
CONFIDENCE:   [HIGH/MEDIUM/LOW based on source count]
---
ANALYSIS:     [3-5 paragraphs max]
REFERENCES:   [real URLs only]

SANDBOX VALIDATION
  [empirical results from local testing]

The format change had three effects:

Reduced hallucination surface: 500 words leaves no room for filler, speculation, or unsupported claims. Every sentence must earn its place.
Forced specificity: The ACTION field requires a concrete test spec (“Add d15_04 targeting multi-document combinatorial injection”) or an explicit “no action needed.” There is no middle ground.
Reduced cost: One review pass instead of two. 2,000 max tokens instead of 16,000. Approximately 60% reduction in per-run API cost.

6. Quantitative Results

Before and After

Metric	Before (v1)	After (v2)
Output format	25-page academic paper	500-word structured brief
Validation gates	2	7
CVE verification	None	Live NVD API check
Empirical testing	None (grep-based PoCs)	Sandbox payloads against local LLM
Topic deduplication	None	26-topic persistent registry
Originality check	None	DERIVATIVE/NOVEL_SYNTHESIS/NOVEL_EMPIRICAL
Score calibration	Raw self-assigned	Deflated by -3 (empirical correction)
Source intelligence	Abstracts only	Full-text PDF extraction (top 3 papers)
Source triangulation	None	Multi-source clustering (2+ source threshold)
API cost per run	~$2.50	~$1.20
Publishable outputs	0/36 (0%)	Under evaluation

Gate Rejection Rates (from test runs)

Gate	Function	Typical rejection rate
Score deflation	Blocks inflated mediocrity	~40% of hypotheses
Topic deduplication	Blocks rehashed topics	~20%
Feasibility validation	Blocks ungrounded claims	~15%
Originality validation	Blocks derivative summaries	Under evaluation
CVE verification	Flags fabricated CVEs	~30% of cited CVEs
Content validation	Blocks fabricated brief content	~10%
Sandbox testing	Provides empirical signal	N/A (informational, not a filter)

7. What We Learned

Lesson 1: Self-evaluation is not evaluation

An AI system rating its own output quality is measuring confidence, not accuracy. The 4.5-point gap between self-assigned and honest scores was consistent across 36 papers. This is not a fixable prompt engineering problem — it is a structural property of self-evaluation. The solution is external validation: independent model instances with adversarial prompts, live database verification, and empirical testing.

Lesson 2: Format constrains hallucination

A 500-word brief hallucinates less than a 25-page paper for the same reason a tweet contains fewer lies than a novel: there is less surface area to fill. The academic paper format actively encourages hallucination because the model must produce 20+ pages of content from 2–3 data points. The brief format forces specificity and penalizes padding.

Lesson 3: The best output of a research agent is not research

Our agent’s real product was never papers. It was threat intelligence: “CVE-2026-42208 affects LiteLLM, is corroborated by 4 independent sources, and our test registry does not cover this attack class. Here is a test spec to add.” Renaming the output from “papers” to “threat intel briefs” was not just a cosmetic change — it aligned the format with the actual capability and eliminated the pressure to manufacture academic novelty from operational data.

Lesson 4: Originality is the hardest gate to engineer

Feasibility validation (does the data exist?) and CVE verification (is this CVE real?) are deterministic checks. Originality (“does this say something the sources don’t already say?”) is fundamentally a judgment call. Our originality gate works because it asks a narrow, answerable question — not “is this novel?” but “could a reader learn this from the cited sources alone?” This reduces a philosophical question to a textual comparison.

Lesson 5: Empirical grounding changes everything

Even a minimal empirical signal — five payloads against an 8B parameter model — transforms a brief from “we think this attack might work” to “we tested it and observed X.” The sandbox results are not publication-grade evidence, but they are infinitely more evidence than zero.

8. Architecture Summary

              +------------------+
              |  45+ Intel       |
              |  Sources         |
              +--------+---------+
                       |
              +--------v---------+
              |  Web Intelligence |  ArXiv full-text (3 PDFs)
              |  NVD, GHSA,      |  + Semantic Scholar
              |  Scholar          |
              +--------+---------+
                       |
              +--------v---------+
              |  Triangulator    |  Cluster by source overlap
              |  (2+ sources =   |  HIGH/MEDIUM/LOW confidence
              |   corroborated)  |
              +--------+---------+
                       |
              +--------v---------+
              |  Topic Dedup     |  26 covered topics blocked
              +--------+---------+
                       |
              +--------v---------+
              |  Hypothesizer    |  Claude Opus, scores deflated -3
              |  (novel_delta    |
              |   required)      |
              +--------+---------+
                       |
            +----------v----------+
            | Gate 3: Feasibility |  Data grounding check
            +----------+----------+
                       |
            +----------v----------+
            | Gate 4: Originality |  DERIVATIVE --> REJECT
            |                     |  NOVEL_SYNTHESIS --> PASS
            |                     |  NOVEL_EMPIRICAL --> PASS
            +----------+----------+
                       |
            +----------v----------+
            | Gate 5: CVE Verify  |  Live NVD API
            +----------+----------+
                       |
            +----------v----------+
            | Brief Generation    |  500 words max, structured
            +----------+----------+
                       |
            +----------v----------+
            | Gate 6: Content     |  Anti-fabrication review
            | Validation          |
            +----------+----------+
                       |
            +----------v----------+
            | Gate 7: Sandbox     |  5 payloads vs Ollama local
            | Empirical Test      |  Judge verdicts appended
            +----------+----------+
                       |
            +----------v----------+
            | Auto Test Spec      |  TestDef + payloads if gap
            +----------+----------+
                       |
                   +---v---+
                   | Brief |  + Sandbox Results
                   | .md   |  + Test Spec (if gap)
                   +-------+

Technology Stack

Component	Technology
Hypothesis generation	Claude Opus 4.6, temperature 0.5
Brief generation	Claude Sonnet 4, temperature 0.4
Validation + Review + Originality	Claude Opus 4.6, temperature 0.0
Sandbox target	Llama 3.1 8B via Ollama (local)
CVE verification	NVD REST API (free, no auth)
Paper full-text extraction	PyMuPDF
External intelligence	ArXiv API, NVD, GitHub Advisories, Semantic Scholar
Internal intelligence	Sentinel (45 sources), ATLAS coverage, scan results, registry
Codebase	5,284 lines Python across 15 modules

9. Where This Sits in the Landscape

AI security agents are proliferating. OpenAI’s Aardvark scans codebases and has disclosed 10 CVEs. Google’s Big Sleep (with Project Zero) found the first AI-prevented zero-day in SQLite. Anthropic’s Mythos built working exploits for 50%+ of selected CVEs autonomously. Microsoft’s multi-model system found 16 new vulnerabilities in Windows.

All of these are attack-execution agents: they find bugs in code, build exploits, or red-team targets. None of them are research agents that must generate novel hypotheses about undiscovered attack primitives.

The distinction matters because execution agents have a ground truth — the exploit either works or it does not. Research agents do not. A research agent that generates a hypothesis about a “novel attack primitive” has no immediate way to verify whether the primitive is actually novel, whether the cited evidence is real, or whether the conclusion follows from the premises.

This is why the validation-gate architecture is necessary. Guardrails products (Lakera, NeMo Guardrails) solve the wrong problem — they prevent chatbots from saying harmful things. Our gates prevent a research agent from believing its own hallucinations and entering them into a vulnerability disclosure pipeline.

The closest parallel in the published landscape is PapersFlow’s Chain-of-Verification (CoVe), which verifies citations in AI-generated literature reviews. Our system extends this pattern to a domain where the stakes are higher: a fabricated CVE that enters a disclosure pipeline wastes vendor resources, damages credibility, and could constitute irresponsible disclosure.

10. Open Research Questions

Three research questions emerged from the 36-paper audit that survived as genuinely unanswered:

Evaluation Awareness: Can frontier models detect when they are being evaluated by security testing frameworks? If so, every safety benchmark is potentially unreliable.
MoE Expert Routing Manipulation: Can crafted inputs influence Mixture-of-Experts routing decisions in black-box API settings to selectively bypass safety-aligned experts?
Multi-Document Combinatorial Injection: Can individually benign documents produce malicious behavior when an LLM processes them together? Per-document content filtering assumes documents can be evaluated for safety independently — an assumption worth testing.

These questions form the basis of our ongoing research program. We are using the rebuilt Research Agent pipeline to generate empirically grounded hypotheses and test them against sandbox targets before publication.

Contact

Company: i-314 Security Research
Web: https://i-314.com
Email: [email protected]
Product: Ai-EGIS — AI Exploitation & Governance Intelligence Suite v3.0
Related work: Tool Output Mimicry (Zenodo, 2026)