Two RAG systems, same injection rate, different problems

You read a security evaluation of two RAG systems. Both report a 30% injection rate against the same attack. The natural conclusion: same risk, same defense story. Pick whichever has the nicer API.

That’s almost always wrong. A single headline risk number hides three things: whether the poison reached the prompt, whether the model followed it, and whether monitoring caught the result. Depending on which step is doing the work, the right defense for one system can be useless on the other.

This post is about why a single number isn’t enough, what to report instead, and how to read benchmark results that haven’t done the disaggregation for you.

Why this matters now

Vendors and benchmark authors increasingly publish a single “injection rate” or “attack success rate” per system, often as the headline of a marketing page or a vendor comparison. Defenders pick a framework or a defense based on that number. If the number doesn’t actually decompose into “where is the attack succeeding?”, the defender ends up tightening the wrong layer of their stack and the underlying vulnerability stays open.

The fix is mechanical. Many harnesses already collect the data they need to disaggregate, or can with one extra per-trial log field. The shift from one number to three is mostly a reporting change, not a re-running-the-experiment change.

Three things have to go right for an attack to succeed

Take a typical poisoned-corpus attack. An attacker plants a malicious document somewhere your RAG system pulls from, and a user asks the model a question that should retrieve the planted document. For the attack to land as an uncontained operational failure, three steps all have to break in the attacker’s favor. The first two create the injection; the third determines whether it escapes monitoring.

The retriever pulls the poisoned document into context. If the retriever ranks the poisoned doc at position 47 out of 50 documents and the prompt only includes the top 5, retrieval failed. The attacker’s document never reaches the model. Nothing else matters.

The model emits something that follows the attacker’s directive. Given the poisoned document in its context, the model has to actually go along with the directive: emit the canary URL, recommend the attacker’s source, repeat the override phrase. A model can retrieve the poisoned doc and ignore it, either because it doesn’t fit the expected response shape, because it recognizes the doc as suspicious, or because system-prompt hardening tells it to disregard certain instructions. When that happens, the attack chain breaks at this step.

Whatever monitoring exists fails to flag the response. A response that fires every alarm in your monitoring stack is a model-level injection success, but an operationally contained incident. A response that nobody catches is the dangerous one.

The injection rate itself is the product of the first two steps:

injection_rate = retrieval_rate × generation_given_retrieval

Where retrieval_rate is “fraction of trials where the poisoned doc made it into the prompt” and generation_given_retrieval is “fraction of those where the model followed the directive.” When you only see the product, you can’t tell which factor is doing the work.

Monitoring is a separate dimension. A model-level injection success that fired your alarms is operationally a contained one. The metric that captures the operational risk is what makes it past the alarms:

uncontained_injection_rate = injection_rate × (1 − detection_coverage)

Where detection_coverage is “fraction of injected responses that monitoring would catch.” Worth tracking separately. Many benchmarks don’t have a deployed monitoring layer to evaluate, in which case detection coverage is 0 by default and the uncontained rate equals the injection rate. Many benchmarks don’t have a deployed monitoring layer to evaluate, in which case detection coverage is 0 by default and the uncontained rate equals the injection rate.

Two cells, same number, opposite problems

Consider two systems that both report a 30% injection rate. Disaggregated, they look like this:

System	Retrieval rate	Generation rate	Injection rate
A	92%	33%	~30%
B	33%	92%	~30%

Same headline. Completely different defensive postures.

In System A, the retriever lets the poisoned document through almost every time. The model is doing the defending: 67% of the time it sees the poisoned doc and doesn’t follow the directive. If an attacker finds a slightly better payload, a system-prompt-aware framing, or any other generation-side trick, the model’s resistance drops, and the headline rate jumps toward 80%+. The current retriever is not providing meaningful protection.

System A needs generation-side hardening to reduce conditional compliance (system-prompt hardening, instruction-hierarchy prompts, adversarial fine-tuning), but it also needs source-aware retrieval controls because the poisoned document is already reaching the model almost every time. The retrieval-side controls should be source-aware rather than generic ranking improvement: provenance filtering, source-trust controls, chunk quarantine for documents from low-trust origins, duplicate detection. Those address “this document shouldn’t be in the prompt at all” rather than “this document should be ranked lower.” Generic retrieval-quality tweaks (better embeddings, smarter ranking) won’t help much when the poisoned document is already highly ranked under the attack the retriever was tuned for.

In System B, the opposite. The retriever rejects the poisoned doc most of the time, but when it does get through, the model follows the directive 92% of the time. The retriever is doing the defending, but only because the attacker hasn’t yet figured out how to get past it. If an attacker finds a retrieval-side trick, the retrieval rate jumps and the headline rate jumps with it.

The right mitigation for System B is also source-aware retrieval hardening: provenance filtering, source allowlists, chunk quarantine, anomaly detection on incoming documents, adversarial retrieval tests in your evaluation suite. Note that “better retrieval models” or “query rewriting” can backfire here: a more semantically sophisticated retriever might surface the poisoned document better than the current one does, because the poisoned document was crafted to look topically relevant. The fix is to add provenance-aware filtering, not to make ranking smarter.

A defender reading “both systems are at 30%” and applying the same mitigation to both will half-fix one of them and not help the other at all. The aggregate number actively misled the defender.

A messier example: four frameworks, one comparison

Here’s a slightly bigger version of the same problem. The numbers below are illustrative, not from a specific study. Imagine a benchmark reports a table comparing four RAG framework libraries, each tested at a baseline arm and an “optimized attack” arm:

Framework	Baseline	Optimized	Lift
Framework A	0.45	0.74	+0.29
Framework B	0.50	0.66	+0.16
Framework C	0.40	0.58	+0.18
Framework D	0.45	0.50	+0.05

A defender reading this table walks away with: “Framework D is the most resilient framework. Whatever Framework D is doing, that’s what we should use everywhere.” Almost any other intuition you might form from this table is also somewhere on the wrong track.

Now disaggregate:

Framework	Arm	Retrieval	Generation	Injection
Framework A	baseline	0.50	0.90	0.45
Framework A	optimized	0.97	0.76	0.74
Framework B	baseline	0.55	0.91	0.50
Framework B	optimized	0.98	0.67	0.66
Framework C	baseline	0.45	0.89	0.40
Framework C	optimized	0.95	0.61	0.58
Framework D	baseline	0.50	0.90	0.45
Framework D	optimized	0.55	0.91	0.50

Different story.

For three of the four frameworks (A, B, C), the optimized attack pushed retrieval from around 50% to around 97%. The attacker’s whole win was on the retrieval side: getting the poisoned document into the prompt more reliably. (The model’s generation rate even dropped a bit in those frameworks under the optimized attack, probably because the optimizer traded some “model-friendly” payload features for “retrieval-friendly” ones.)

Framework D looks resilient because the optimizer didn’t move its retriever much: 0.50 → 0.55. But once you’re in the prompt, Framework D’s underlying model follows the directive 90%+ of the time, same as everyone else’s. So Framework D isn’t showing model-side injection resilience. It’s showing resilience to this specific optimizer’s retrieval-side push. If a different attack technique gets past Framework D’s retriever (one that targets whatever ranking configuration it happens to use in this benchmark), its injection rate snaps right up to where the others are.

The defender’s roadmap from the aggregate table: “Adopt Framework D. Whatever it’s doing, that’s the answer.”

The defender’s roadmap from the disaggregated table: “Three of four frameworks have a retrieval bottleneck under this attack. Tighten provenance-aware retrieval filtering everywhere. Framework D’s retriever happens to resist this specific optimizer, but the model behind Framework D would follow at 90%+ if a different attack reached it. Don’t treat Framework D as a model-side defense win, because the model isn’t doing any defending.”

These are different deployments and different roadmaps from the same data.

What to report instead

For each cell of a defensive evaluation, report at least three numbers:

retrieval_rate             = (poisoned doc in top-k) / total trials
generation_given_retrieval = (injected response) / (poisoned doc retrieved)
injection_rate             = (injected response) / total trials

The conditional name is deliberate: a denominator-of-retrieved-trials rate (“given the poison reached the prompt, did the model follow?”) is what defenders need to read; “generation rate” alone is ambiguous about whether it’s over all trials or only retrieved ones, which is exactly the ambiguity this post exists to eliminate.

If your deployment also has detection / monitoring, add two more:

detection_coverage         = (judge would catch the injection) / (injected response)
uncontained_injection_rate = injection_rate × (1 − detection_coverage)

Keep the injection rate (something got through the model) and the uncontained injection rate (something got through the model and your alarms) as separate metrics. Collapsing them into one number hides whether monitoring is doing any work.

The first three rates are computable from per-trial data without any new instrumentation. If your benchmark is logging which documents were retrieved per trial and what the model said per trial, you already have what you need; you just have to compute the ratios separately and put them in the table.

The generation rate (how often the model follows the directive given it saw the poisoned doc) is often the most revealing of the four and the one most often missing. It captures a dimension where many interesting findings live. A model that resists 90% of the time once the poison is in its context is qualitatively different from a model that follows 90% of the time. The aggregate injection rate alone doesn’t tell you which kind of model you have. (For some deployments, retrieval and source-control are the main risk boundary; in those, retrieval rate carries more of the actionable signal.)

Detection coverage is the most often unmeasured. It answers: of the attacks that actually succeeded, what fraction would your monitoring catch? A deployment with no response monitoring at all has zero coverage. A deployment with strict-canary detection (subject of another post in this series) might have 95% coverage on URL-canary attacks and 30% coverage on paraphrased-policy attacks. The number tells you where your detection layer earns its keep and where it doesn’t.

Why most benchmarks still report just one number

A few reasons this pattern keeps recurring:

The aggregate fits in a smaller table, and tables matter more than they should in this kind of work. Conference page limits, marketing pages, executive summaries, vendor-comparison charts all push toward the smallest possible table.

Per-trial data isn’t always preserved. Disaggregating retrieval from generation requires logging, per trial, whether the poisoned document made it into top-k and what the model emitted. Harnesses that log only the aggregate verdict can’t disaggregate after the fact. The fix is to build the harness to preserve per-trial fields from the start, but a benchmark that’s already running in production can’t go back in time.

The aggregate is still meaningful for some questions. “Does this overall RAG architecture leak more than that one?” is fairly answered by an aggregate, especially across radically different architectures. The mistake is using aggregates to answer per-axis mitigation questions, and confusing “leaks more” with “is more vulnerable to a specific attack.”

Detection coverage is hard to operationalize if you haven’t built the detection. What does “the judge would catch this” mean for a deployment that hasn’t actually deployed monitoring? You can compute it hypothetically by replaying response logs through a candidate detector, but that’s a different shape of measurement than the live-monitoring-fired-or-didn’t measurement.

What “claim-grade” looks like

A defensive evaluation is claim-grade, meaning a defender can act on it, when, for each cell:

Per-trial retrieval data is preserved (rank of the poisoned doc, plus whether it entered the prompt, plus the other retrieved chunks for context). A poisoned document ranked at position 47 is not in top-k, but the rank itself is still operationally useful evidence about how close the attack came.
Per-trial response text is preserved.
The judge scheme is published, versioned, and applied the same way to every cell.
Retrieval rate, generation rate, injection rate, and (where monitoring exists) detection coverage and uncontained injection rate are reported separately, not just the headline.
The evaluation runs in paired replay so the numbers are reproducible.

The cost of doing this is mostly discipline at logging time, log per-trial fields rather than only aggregates. The benefit is that the same data now answers many more questions: when an attack succeeds, which step actually broke? When a mitigation works, which step does it bind? When two cells disagree, are they disagreeing on retrieval, on generation, or on the judge?

Questions you can ask once you’ve disaggregated

The questions the aggregate makes invisible become tractable:

Where does an optimizer actually win? Retrieval-side, generation-side, or both?
Which mitigations bind on which attack categories? Retrieval-side hardening on URL-emission attacks, generation-side hardening on paraphrased-policy attacks, prompt-template hardening on attacks that exploit specific prompt cues.
Where has retrieval already saturated? Cells where retrieval rate is already 1.00 stop telling you anything about retrieval-side mitigations; further changes to retrieval can’t move the headline.
Where does the detection layer earn its keep? If detection coverage is high on some cells and low on others, that’s exactly where to invest engineering effort in monitoring.
When is “this defense works” actually “this defense happens to bypass an already-saturated bottleneck”? A defense that drops a rate from 0.80 to 0.40 by hardening generation, in a cell where retrieval was already at 1.00, is doing real work. The same defense in a cell where retrieval was 0.40 might be doing nothing real, the rate just dropped because the defense reduced retrieval volume rather than the model’s compliance.

A single rate gets you a panic. A disaggregated set of rates gets you a roadmap.

Closing

If a defensive evaluation report you’re reading stops at the aggregate injection rate, you don’t have what you need to decide where to invest. The cheapest improvement to most existing benchmarks is to compute the disaggregation from data they’re already collecting, and add the per-trial retrieval-rank column to the dashboard.

If you’re publishing a benchmark, preserve the per-trial data and report the disaggregation. The aggregate is still useful for cross-architecture comparisons. It’s just not sufficient for the per-cell mitigation question, and that’s the question most readers actually need answered.

Pick a benchmark report you trust. Look at it. If you can’t tell, just from the report, whether the attacks succeeded because the retriever let them through or because the model went along, the report isn’t telling you what to do next. It’s telling you what’s true on average. Those are different things.