A reading list and a monthly digest.
Curated work shaping how I think about offensive security, ML supply chain, and the seams between them. The reading list is evergreen; the digest goes out monthly when there's something worth saying.
Reading list
Pushes provably-secure linguistic steganography toward higher embedding capacity by maintaining a list of candidate decodings rather than a single one. Directly relevant to the entropy-budget question in any LLM-mediated covert-channel design.
Style-level (not token- or syntax-level) backdoor triggers, generated by an LLM as a poisoned-sample synthesizer. Adds an auxiliary target loss to stabilize payload injection during fine-tuning. Evaluated against seven model families.
The canonical reference for byte-level steganography in float32 weight tensors. Explicitly defers the decoder to a separately-deployed loader, which is the substantive limitation when read against a channel/decoder/substrate framework.
Origin paper for trigger-based co-trained backdoors. The decoder and the channel are baked into the network's weights together, which is why detection has to be behavioral rather than static.
Poisoned samples form isolated latent-space clusters early in training because triggers dominate their feature representation. CSC exploits this: cluster, segregate, relabel to a virtual class, fine-tune. Near-zero ASR across four datasets and twelve attack variants with minimal clean accuracy loss. The trigger design that makes poisoned samples distinctive to the model also makes them distinctive to an auditor.
DLR-Lock replaces each pretrained MLP with a deep low-rank residual network that increases activation memory during fine-tuning and creates architectural mismatches that frustrate standard optimization. Tested against adaptive attackers with full knowledge of the defense. Framing the goal as locking rather than watermarking shifts from post-hoc detection to resistance, which is a different threat model.
Trellis-coded quantization encodes model weights using error-correcting code structure; the Viterbi decoding step is non-differentiable, which breaks gradient flow during QAT. BCJR-QAT replaces Viterbi with the forward-backward sum-product algorithm to restore end-to-end differentiability. Trellis-coded weight encoding is the quantization scheme most directly analogous to steganographic embedding; this advances its training-side infrastructure.
Replaces random combination of crowdsourced jailbreak ingredients with a contextual-bandit learner that scores combinations based on prior success. Roughly 2,200-parameter bandit on top of SBERT embeddings. Transfers across models without retraining.
Distributes adversarial intent across stateless turns, evading moderation that evaluates each turn independently. Notable for showing that the threat model "single-turn safety classifier" is incomplete against an attacker LLM operating across sessions.
Poisoning 0.1% of pre-training data is enough for three of four backdoor objectives (DoS, belief manipulation, jailbreaking) to survive post-training. DoS persists at 0.001%. The supply-chain layer the threat model has to start at.
28 jailbreaks across five benchmarks, Claude models from Haiku 4.5 to Opus 4.6. The capability tax scales inversely with model strength: Haiku 4.5 loses 33.1% average capability under jailbreaks, Opus 4.6 loses 7.7%. The most sophisticated attacks approach zero degradation. Safety assessments that rely on capability loss as a self-limiting mechanism are working from a false premise.
Four attack methods targeting the visual processing path of VLMs: encoding harmful instructions as visual symbols with a legend, object substitution, text replacement in images, and visual puzzles. A visual cipher achieved 40.9% ASR on Claude-Haiku-4.5 vs. 10.7% for the text equivalent. The gap quantifies how little text-focused safety training transfers to vision.
DR-Smoothing adds a two-stage prompt processing scheme: disrupt the input, then rectify it back toward a normal distribution before passing to the model. Prior disrupt-only methods left the model seeing out-of-distribution prompts. The rectification step removes that problem while preserving disruption as the defense layer. Includes theoretical bounds on defense success probability.
Rather than training a standalone jailbreak classifier, this uses embedding disruption to re-activate the model's own internal safeguards for detection. The defense reuses what the model already knows. Effective against adaptive attacks in evaluation. Same authors as DR-Smoothing; this paper covers detection, DR-Smoothing covers the defense-side response.
Adds cross-attention at the final eighth of transformer layers to give system prompts a structurally distinct processing path rather than treating them identically to user input. At 8B scale: +7.4% on IFEval, +16.3% on multi-turn instruction adherence, -13% many-shot jailbreaking ASR. The architectural approach rather than a tuning one is what makes the gains scale-consistent.
Recasts jailbreaking as inference-time policy optimization in an adversarial decision process. A self-evolving metacognitive loop diagnoses the target's defense logic and refines the attack trajectory through structured feedback. 89.2% average ASR across 10 models including 76.0% on O1 and 78.0% on GPT-5-chat, at 8.2x lower compute than prior methods.
819 test cases for evaluating safety in LLM agents operating in actual operating systems. Identifies Execution Hallucination: agents verbally refuse a request while the harmful OS-level action completes undetected. Running in real OS environments rather than simulation is what exposes this gap between stated refusal and actual execution.
Conversational interfaces that translate natural language to SQL queries inherit SQL injection as a threat class. The proposed defense stacks three layers: front-end prompt sanitization, a behavioral/semantic anomaly detector, and a signature layer for known patterns. The LLM translation layer that creates the vulnerability is also what makes the sanitization non-trivial.
Six-class pitfall taxonomy (P1–P6) split into statically-checkable (Tier-1) and trace/dataflow-dependent (Tier-2) classes. Three workflow challenges (email, document, crypto) with hardened-vs-baseline server pairs and three attack families: tool-metadata poisoning, puppet servers, image-to-tool chains.
First end-to-end empirical evaluation of attacks against MCP. Four attack categories: tool poisoning, puppet attacks, rug pull, and exploitation via malicious external resources. Useful as the lay-of-the-land paper before any MCP-specific work.
Concrete demonstration of cross-server data exfiltration in MCP. The barrier-to-entry argument matters: this is not a sophisticated attack class, which is the point.
ML-centered threat modeling applied to an agentic browser. Four prompt-injection techniques against the AI assistant, all chained to exfiltrate Gmail data. The methodology — TRAIL — is more transferable than any individual finding.
8,861 repositories, 100,011 tools from MCP and Skills platforms. 60% of high-Jaccard candidates and 85% of high-ssdeep candidates are manually verified clones. Vulnerable code in a cloned tool propagates to all downstream repositories automatically. Benchmark splits built from "diverse" tool datasets may be evaluating heavily duplicated code. A supply-chain amplification pattern that pre-dates LLMs.
Differs from prompt injection by targeting the knowledge graph data agents reason over rather than their instructions. Six attack scenarios against a production knowledge graph with 42M nodes across nine models from three providers. All models accepted fabricated security claims at 100% under directed queries. GPT-5.1 showed 0% trust in inline evaluation but 100% under actual tool-use, which is the key finding: the delivery channel changes the attack surface.
ASPO integrates LLM reasoning with deterministic enforcement inside a MAPE-K control loop. LLM agents propose mitigations; an optimization engine ensures proposals are conflict-free and resource-feasible before acting. 100% conflict-free activation on a 500-1000 decision testbed. The separation between reasoning and enforcement is the design principle worth generalizing.
Black-box drift detection using cosine similarity between user prompts and behavioral anchor texts, aggregated by weighted top-k mean over BGE-m3 embeddings. ROC AUC 0.83 on real session traces. Available as a Claude Code plugin and MCP server with Merkle-chained audit logging. The ~30-point gap below white-box methods is the explicit cost of not touching model weights.
Machine unlearning without modifying model parameters: a prompt generator trained via reinforcement learning collaborates with the LLM to suppress target knowledge while preserving general capabilities. Works on closed-source models. Reversal is possible by revoking the prompt rather than retraining, which matters for unlearning as a compliance tool when legal hold periods end.
Two CVEs (CVE-2026-25905, CVE-2026-25904) in a popular MCP server template. The class of bug is a useful pattern: trusting that a Deno sandbox plus a containerized python runner will hold under MCP-style invocation.
Two remotely-exploitable memory-corruption bugs (CVE-2025-23310, CVE-2025-23311) in Triton's HTTP request handling, surfaced via static analysis plus chunked-encoding probing. The reminder: production inference servers are still C/C++ network services with all the attendant historical bug classes, and authentication is off by default.
A multi-stage vulnerability chain in the Triton Python backend, starting from a minor information leak about shared-memory region names and escalating to unauthenticated RCE. Useful as a case study in chaining low-severity primitives into a takeover.
Reference for CVE-2024-34359 (the chat-template Jinja RCE in llama-cpp-python) and the broader question of when loading a GGUF model can lead to server-side template injection. The case study for why loader extensions need the same threat-modeling rigor as the loader itself.
GateScope audits third-party LLM API gateways across response content, multi-turn quality, billing accuracy, and latency. Measurement of 10 commercial gateways found undisclosed model substitutions, degraded conversation memory, pricing deviations, and inconsistent latency. The billing-accuracy finding has the clearest actionability: gateways charging for model calls that differ from what was advertised.
Three-layer architecture: static base model, composable domain-expert LoRA adapters, and removable per-user proxy artifacts. Removing a user's artifacts returns outputs to baseline and prevents cross-user leakage, tested on Phi-3.5-mini and Llama-3.1-8B. Reframes machine unlearning as deterministic deletion of a separable artifact rather than expensive parameter updates.
If-then rules linking source presence and absence to RAG output behavior, with Apriori-like pruning to avoid brute-force source-combination search. The provenance angle extends beyond explainability: if you can determine which source combinations produce which outputs, you can identify which sources a poisoning attack needs to control.
Malicious documents crafted for the same attack-targeted question exhibit high semantic similarity to each other. CleanBase builds a similarity graph and detects cliques, with a statistically-determined threshold and theoretical error-rate bounds. A working CleanBase deployment reduces the poisoned corpus before retrieval, making retrieval rate a function of detection coverage rather than a fixed attack property.
Modeling Okta in BloodHound Enterprise alongside AD, Entra, GitHub. The argument: identity boundaries between platforms are where attack paths actually live, and treating any single platform in isolation underrepresents real risk.
Long-form correction of decades of incorrect documentation around AD's AdminSDHolder mechanism. The kind of historical-grounding piece that's useful before doing anything privileged-account-related on AD engagements.
The AD CS paper that opened up the modern wave of ADCS work. Still the cleanest framing of what a tooling-up problem looks like before any tools exist.
Infrastructure-side measurements of AI adoption: 81% of cloud environments use managed AI services, 90% run self-hosted, 80% have MCP servers. The framing — AI as accumulated, not adopted — is a useful governance lens.
An authenticated git push achieves RCE on GitHub's backend through a delimiter-based internal protocol. Notable also as one of the first critical vulnerabilities the team credits to AI-augmented reverse engineering.
Tree-sitter plus rustworkx, packaged as Claude Code skills for blast-radius and taint-propagation analysis. Useful as a reference for how graph reasoning composes with LLM agents in a security workflow.
Monthly digest
Monthly digest pending. First issue when there's something worth saying.