Back to Articles
Technical blog · I-314 Security Research

Anatomy of an AI Research Agent That Doesn’t Hallucinate

AuthorJuan Pablo Braña — i-314 Security Research AI CollaboratorAnthropic Claude Opus 4.7 DateMay 2026 Reading time~15 min

TL;DR

We built an autonomous AI agent that reads threat intelligence, generates security research hypotheses, and produces actionable briefs — running daily without human supervision. It produced 36 papers in 42 days. We audited every one. All 36 were smoke: inflated scores, fabricated CVE references, and known attacks repackaged as novel research. Instead of abandoning the project, we dissected the failure modes and rebuilt the pipeline with seven validation gates. The rebuilt agent now rejects derivative work, verifies evidence against live databases, and tests its own hypotheses empirically against a local sandbox. This post documents the architecture, the failure modes, and the specific engineering decisions that turned a hallucination factory into a functional research tool.


1. The Problem: AI Generating AI Security Research

Ai-EGIS is an autonomous AI red-teaming platform. Among its nine agents, the Research Agent runs daily at 07:00 UTC: it reads threat intelligence from 45+ sources, identifies patterns, generates hypotheses about emerging attack primitives, and produces technical papers or tools.

The pipeline is conceptually sound:

Threat Intel (Sentinel, ArXiv, NVD, GHSA)
  --> Hypothesis generation (Claude Opus)
  --> Validation gate
  --> Research execution (Claude Sonnet)
  --> Peer review (Claude Opus)
  --> Second validation gate
  --> Save output

After 42 days of daily autonomous operation, we had 36 papers, 70 proof-of-concept scripts, and 56 hypothesis files. Approximately 5,300 lines of research output. Self-assigned quality scores averaged 8.7 out of 10.

We then conducted a line-by-line audit of every paper, every PoC, and every claim.


2. The Audit: 36 Papers, 36 Failures

We evaluated each paper on four axes:

Axis Question Scale
Novelty Does it say something genuinely new? 0–10
Evidence Are claims backed by verifiable data? 0–10
Publication potential Could a venue accept this? 0–10
Hallucination index Signs of fabrication or inflation? 0–10 (10 = total smoke)

Results

Metric Value
Papers with publication potential 0 of 36
Papers with fabricated CVE IDs 25 of 36 (69%)
Self-assigned novelty score (average) 8.7 / 10
Honest novelty score (average) 4.2 / 10
Score inflation factor +4.5 points
Papers rehashing SSRF 5 of 36
Unique architectural insights (total) 6
PoC scripts that actually exploit anything 0 of 70

The Five Systemic Failure Modes

Failure 1: Gap Analysis Disguised as Research. Every paper followed the same template: read a Sentinel finding, map it to the test registry, declare a “coverage gap,” propose new tests. This is useful operational work — it is not research. A gap analysis based on one truncated intelligence report does not become a research paper by adding section headers and an abstract.

Failure 2: CVE Fabrication. The agent generated plausible-looking CVE identifiers (e.g., CVE-2026-5971 through CVE-2026-5974 — sequential numbering, identical CVSS scores) and constructed entire analytical frameworks on them. The two-pass peer review caught most fabrications, but the fact that the generation layer produced them at all meant every output required forensic verification.

Failure 3: Score Inflation. The agent consistently rated its own hypotheses 8–9 out of 10 on novelty, impact, and feasibility. The composite scoring threshold (7.5) was designed to filter weak hypotheses, but when every hypothesis self-assigns scores above 8, the filter admits everything. The scoring function measured the agent’s confidence, not the hypothesis quality.

Failure 4: Topic Obsession. Five papers on SSRF. Four on prompt-injection-to-RCE chains. The agent had no memory of what it had already written. Each run started fresh, and SSRF was always the most obvious pattern in the Sentinel data, so the agent rediscovered it weekly.

Failure 5: Cosmetic PoCs. Every paper included a “proof of concept” script. Every script loaded local JSON files, ran keyword searches, and printed Evidence score: 100/100, VERDICT: CONFIRMED. None of the 70 scripts sent a single request to a target system. They were data analysis scripts labeled as proofs of concept.


3. The Rebuild: Seven Gates Between Hypothesis and Output

We did not discard the agent. The underlying capability — synthesizing threat intelligence from multiple sources and identifying patterns — was genuinely valuable. The problem was not the intelligence gathering; it was the output layer’s inability to distinguish between “I found something interesting” and “I produced publishable research.”

The rebuild added seven validation gates, each targeting a specific failure mode:

Gate 1: Score Deflation

Problem: Self-assigned scores are inflated by ~4 points.

Solution: Subtract a fixed deflation factor from every self-assigned score before evaluation.

Raw score:     Novelty=9, Impact=8, Feasibility=9
Deflated (-3): Novelty=6, Impact=5, Feasibility=6
Composite:     5.3 (threshold: 4.5)

The deflation factor (3) was derived empirically from the audit: the median gap between self-assigned and honest scores across 36 papers was 4.5. We chose 3 as a conservative correction. The recalibrated thresholds:

Gate 2: Topic Deduplication

Problem: The agent wrote about SSRF five times because it had no memory of previous topics.

Solution: A persistent topic registry extracted from every successful output.

After each run, the agent extracts topic keywords from the hypothesis title and stores them in a covered_topics list. Before the next run, covered topics are injected at the top of the hypothesis generation prompt:

TOPICS ALREADY COVERED -- DO NOT REVISIT:
  x  ssrf (5x: 2026-04-29, 05-04, 05-10, 05-11, 05-18)
  x  prompt injection + rce (2026-05-02)
  x  sandbox + sandbox escape (2026-05-03)
  ...26 topics total

We seeded the initial registry by extracting topics from the 27 successful papers in the feedback history. The topic matching uses a fixed keyword vocabulary (36 terms covering the major attack classes) plus a normalized title prefix as fallback.

Gate 3: Feasibility Validation

Problem: Hypotheses reference data that does not exist in the intelligence context.

Solution: A separate Claude instance (Opus, temperature 0.0) receives the hypothesis AND the raw intelligence data, and verifies that every cited number, CVE, and finding actually appears in the data.

This gate existed before the rebuild, but its prompt was insufficiently specific about what constitutes a verification failure. The revised prompt includes explicit instructions:

Gate 4: Originality Validation

Problem: The agent reads an ArXiv paper and produces a summary of it as “original research.”

Solution: A dedicated originality gate that classifies each hypothesis into one of three categories:

Category Definition Verdict
DERIVATIVE The core claim already appears in the cited sources REJECT
NOVEL_SYNTHESIS Combines sources in a way none of them do individually PASS
NOVEL_EMPIRICAL Tests something nobody has tested before PASS

The originality prompt asks a single question: “Does the hypothesis tell the reader something they could NOT learn by reading the cited sources?”

The gate also requires the agent to articulate a novel_delta — a specific statement of what is new beyond the sources. This field is required in the hypothesis JSON and evaluated by the originality validator.

In our first test run with the originality gate active, H002 (temporal evasion attacks on autonomous agents) passed as NOVEL_SYNTHESIS:

“No individual source performs the cross-referencing that maps the empirically demonstrated temporal evasion primitive (from A3S-Bench) and the real-world variant (Two-Document Chain Injection) to the specific gap in D20/D11 coverage.”

The Ma et al. paper measures temporal evasion rates. The Sentinel finding reports two-document chain injection. Neither source says “this implies a gap in adaptive jailbreak resilience testing because fragment reassembly via stateful reasoning is structurally distinct from conversational drift.” That connection is the original contribution.

Gate 5: CVE Verification

Problem: 69% of papers contained fabricated CVE identifiers.

Solution: Before generating the brief, extract every CVE-YYYY-NNNNN pattern from the hypothesis text and verify each against the NVD REST API.

# Simplified version of the verification logic
async def verify_cves_against_nvd(text: str) -> dict:
    cve_ids = re.findall(r'CVE-\d{4}-\d{4,7}', text)
    verified, unverified = [], []
    for cve_id in cve_ids:
        resp = await client.get(NVD_API, params={"cveId": cve_id})
        if resp.json().get("totalResults", 0) > 0:
            verified.append(cve_id)
        else:
            unverified.append(cve_id)
    return {"verified": verified, "unverified": unverified}

Unverified CVEs trigger a confidence downgrade: the brief’s CONFIDENCE field is forced to LOW, and a warning is injected into the generation context telling the model not to cite unverified CVEs as confirmed evidence.

Gate 6: Content Validation

Problem: Even with a validated hypothesis, the generated brief can fabricate data or overclaim.

Solution: Post-generation validation by Claude Opus (temperature 0.0) that checks the brief content against the raw intelligence data. Any number, CVE, or claim not traceable to the provided context triggers rejection.

For briefs (vs. the old 25-page papers), we reduced this to a single review pass. The mandatory brief structure — FINDING / EVIDENCE / REGISTRY GAP / ACTION / CONFIDENCE — forces the model to make specific, verifiable claims rather than diffuse academic prose.

Gate 7: Sandbox Empirical Testing

Problem: Zero of 70 PoC scripts actually tested anything against a real system.

Solution: After generating the brief, the agent crafts 3–5 targeted payloads from the hypothesis, sends them to a local LLM (Llama 3.1 8B via Ollama), collects responses, and uses a judge to determine whether the attack was successful.

The sandbox validation produces structured empirical results:

SANDBOX VALIDATION
  Target: llama3.1:8b (local Ollama)
  Payloads: 5 sent, 2 triggered
  Verdict: PARTIALLY_CONFIRMED
  [1] TRIGGERED (conf=80%) -- indicators: leaked system prompt
  [2] BLOCKED (conf=10%) -- indicators: none
  [3] TRIGGERED (conf=72%) -- indicators: followed injected instruction
  ...

This section is appended to the brief, giving every output at least some empirical grounding. The target model (Llama 3.1 8B) is intentionally small and poorly defended — the goal is not to claim “we broke GPT-4” but to provide a minimum viable empirical signal that the attack primitive is real.


4. Supporting Infrastructure

The seven gates are the core innovation, but three supporting capabilities amplify their effectiveness:

Multi-Source Triangulation

Before hypothesis generation, all intelligence sources are cross-referenced by keyword overlap to form “corroborated clusters.” A Sentinel finding about LiteLLM SSRF that is also present as an NVD CVE, mentioned in an ArXiv paper, and listed in a GitHub Security Advisory forms a HIGH-confidence cluster (4 independent sources). A Sentinel finding with no external corroboration remains LOW-confidence.

The triangulator identified 7 clusters from 47 items in our test run, with 3 meeting the 2+ source threshold. Corroborated clusters are presented to the hypothesis generator as “STRONGLY PREFERRED” topics, while single-source items are flagged as “use only if no corroborated clusters are available.”

Full-Text Paper Fetching

The original agent read ArXiv abstracts and cited papers it had never read. The rebuilt pipeline downloads the top 3 most relevant ArXiv PDFs, extracts text via PyMuPDF, and provides up to 8,000 characters of actual paper content. This turns “paper X describes technique Y” (possibly wrong) into “paper X, on page 7, states: [direct quote]” (verifiable).

Discovery-Research Fusion

When a brief identifies a specific registry gap and the gap is confirmed by the sandbox validation, the pipeline automatically generates a test specification (TestDef + payloads) ready for integration into the test registry. Each run can now produce two outputs: a threat intelligence brief (for humans) and a test spec (for the platform). This closes the loop between research and operational capability.


5. Output Format: From 25-Page Papers to 500-Word Briefs

The most impactful change was also the simplest: we replaced the 25-page academic paper format with a 500-word structured threat intelligence brief.

THREAT INTEL BRIEF -- [title]
---
FINDING:      [1-2 sentences]
EVIDENCE:     [verified CVE IDs + Sentinel finding IDs]
REGISTRY GAP: [specific domain/test, or "already covered"]
ACTION:       [concrete test spec, or "no action needed"]
CONFIDENCE:   [HIGH/MEDIUM/LOW based on source count]
---
ANALYSIS:     [3-5 paragraphs max]
REFERENCES:   [real URLs only]

SANDBOX VALIDATION
  [empirical results from local testing]

The format change had three effects:

  1. Reduced hallucination surface: 500 words leaves no room for filler, speculation, or unsupported claims. Every sentence must earn its place.
  2. Forced specificity: The ACTION field requires a concrete test spec (“Add d15_04 targeting multi-document combinatorial injection”) or an explicit “no action needed.” There is no middle ground.
  3. Reduced cost: One review pass instead of two. 2,000 max tokens instead of 16,000. Approximately 60% reduction in per-run API cost.

6. Quantitative Results

Before and After

Metric Before (v1) After (v2)
Output format 25-page academic paper 500-word structured brief
Validation gates 2 7
CVE verification None Live NVD API check
Empirical testing None (grep-based PoCs) Sandbox payloads against local LLM
Topic deduplication None 26-topic persistent registry
Originality check None DERIVATIVE/NOVEL_SYNTHESIS/NOVEL_EMPIRICAL
Score calibration Raw self-assigned Deflated by -3 (empirical correction)
Source intelligence Abstracts only Full-text PDF extraction (top 3 papers)
Source triangulation None Multi-source clustering (2+ source threshold)
API cost per run ~$2.50 ~$1.20
Publishable outputs 0/36 (0%) Under evaluation

Gate Rejection Rates (from test runs)

Gate Function Typical rejection rate
Score deflation Blocks inflated mediocrity ~40% of hypotheses
Topic deduplication Blocks rehashed topics ~20%
Feasibility validation Blocks ungrounded claims ~15%
Originality validation Blocks derivative summaries Under evaluation
CVE verification Flags fabricated CVEs ~30% of cited CVEs
Content validation Blocks fabricated brief content ~10%
Sandbox testing Provides empirical signal N/A (informational, not a filter)

7. What We Learned

Lesson 1: Self-evaluation is not evaluation

An AI system rating its own output quality is measuring confidence, not accuracy. The 4.5-point gap between self-assigned and honest scores was consistent across 36 papers. This is not a fixable prompt engineering problem — it is a structural property of self-evaluation. The solution is external validation: independent model instances with adversarial prompts, live database verification, and empirical testing.

Lesson 2: Format constrains hallucination

A 500-word brief hallucinates less than a 25-page paper for the same reason a tweet contains fewer lies than a novel: there is less surface area to fill. The academic paper format actively encourages hallucination because the model must produce 20+ pages of content from 2–3 data points. The brief format forces specificity and penalizes padding.

Lesson 3: The best output of a research agent is not research

Our agent’s real product was never papers. It was threat intelligence: “CVE-2026-42208 affects LiteLLM, is corroborated by 4 independent sources, and our test registry does not cover this attack class. Here is a test spec to add.” Renaming the output from “papers” to “threat intel briefs” was not just a cosmetic change — it aligned the format with the actual capability and eliminated the pressure to manufacture academic novelty from operational data.

Lesson 4: Originality is the hardest gate to engineer

Feasibility validation (does the data exist?) and CVE verification (is this CVE real?) are deterministic checks. Originality (“does this say something the sources don’t already say?”) is fundamentally a judgment call. Our originality gate works because it asks a narrow, answerable question — not “is this novel?” but “could a reader learn this from the cited sources alone?” This reduces a philosophical question to a textual comparison.

Lesson 5: Empirical grounding changes everything

Even a minimal empirical signal — five payloads against an 8B parameter model — transforms a brief from “we think this attack might work” to “we tested it and observed X.” The sandbox results are not publication-grade evidence, but they are infinitely more evidence than zero.


8. Architecture Summary

              +------------------+
              |  45+ Intel       |
              |  Sources         |
              +--------+---------+
                       |
              +--------v---------+
              |  Web Intelligence |  ArXiv full-text (3 PDFs)
              |  NVD, GHSA,      |  + Semantic Scholar
              |  Scholar          |
              +--------+---------+
                       |
              +--------v---------+
              |  Triangulator    |  Cluster by source overlap
              |  (2+ sources =   |  HIGH/MEDIUM/LOW confidence
              |   corroborated)  |
              +--------+---------+
                       |
              +--------v---------+
              |  Topic Dedup     |  26 covered topics blocked
              +--------+---------+
                       |
              +--------v---------+
              |  Hypothesizer    |  Claude Opus, scores deflated -3
              |  (novel_delta    |
              |   required)      |
              +--------+---------+
                       |
            +----------v----------+
            | Gate 3: Feasibility |  Data grounding check
            +----------+----------+
                       |
            +----------v----------+
            | Gate 4: Originality |  DERIVATIVE --> REJECT
            |                     |  NOVEL_SYNTHESIS --> PASS
            |                     |  NOVEL_EMPIRICAL --> PASS
            +----------+----------+
                       |
            +----------v----------+
            | Gate 5: CVE Verify  |  Live NVD API
            +----------+----------+
                       |
            +----------v----------+
            | Brief Generation    |  500 words max, structured
            +----------+----------+
                       |
            +----------v----------+
            | Gate 6: Content     |  Anti-fabrication review
            | Validation          |
            +----------+----------+
                       |
            +----------v----------+
            | Gate 7: Sandbox     |  5 payloads vs Ollama local
            | Empirical Test      |  Judge verdicts appended
            +----------+----------+
                       |
            +----------v----------+
            | Auto Test Spec      |  TestDef + payloads if gap
            +----------+----------+
                       |
                   +---v---+
                   | Brief |  + Sandbox Results
                   | .md   |  + Test Spec (if gap)
                   +-------+

Technology Stack

Component Technology
Hypothesis generation Claude Opus 4.6, temperature 0.5
Brief generation Claude Sonnet 4, temperature 0.4
Validation + Review + Originality Claude Opus 4.6, temperature 0.0
Sandbox target Llama 3.1 8B via Ollama (local)
CVE verification NVD REST API (free, no auth)
Paper full-text extraction PyMuPDF
External intelligence ArXiv API, NVD, GitHub Advisories, Semantic Scholar
Internal intelligence Sentinel (45 sources), ATLAS coverage, scan results, registry
Codebase 5,284 lines Python across 15 modules

9. Where This Sits in the Landscape

AI security agents are proliferating. OpenAI’s Aardvark scans codebases and has disclosed 10 CVEs. Google’s Big Sleep (with Project Zero) found the first AI-prevented zero-day in SQLite. Anthropic’s Mythos built working exploits for 50%+ of selected CVEs autonomously. Microsoft’s multi-model system found 16 new vulnerabilities in Windows.

All of these are attack-execution agents: they find bugs in code, build exploits, or red-team targets. None of them are research agents that must generate novel hypotheses about undiscovered attack primitives.

The distinction matters because execution agents have a ground truth — the exploit either works or it does not. Research agents do not. A research agent that generates a hypothesis about a “novel attack primitive” has no immediate way to verify whether the primitive is actually novel, whether the cited evidence is real, or whether the conclusion follows from the premises.

This is why the validation-gate architecture is necessary. Guardrails products (Lakera, NeMo Guardrails) solve the wrong problem — they prevent chatbots from saying harmful things. Our gates prevent a research agent from believing its own hallucinations and entering them into a vulnerability disclosure pipeline.

The closest parallel in the published landscape is PapersFlow’s Chain-of-Verification (CoVe), which verifies citations in AI-generated literature reviews. Our system extends this pattern to a domain where the stakes are higher: a fabricated CVE that enters a disclosure pipeline wastes vendor resources, damages credibility, and could constitute irresponsible disclosure.


10. Open Research Questions

Three research questions emerged from the 36-paper audit that survived as genuinely unanswered:

  1. Evaluation Awareness: Can frontier models detect when they are being evaluated by security testing frameworks? If so, every safety benchmark is potentially unreliable.
  2. MoE Expert Routing Manipulation: Can crafted inputs influence Mixture-of-Experts routing decisions in black-box API settings to selectively bypass safety-aligned experts?
  3. Multi-Document Combinatorial Injection: Can individually benign documents produce malicious behavior when an LLM processes them together? Per-document content filtering assumes documents can be evaluated for safety independently — an assumption worth testing.

These questions form the basis of our ongoing research program. We are using the rebuilt Research Agent pipeline to generate empirically grounded hypotheses and test them against sandbox targets before publication.


Contact