Home/Signal

A reading list and a monthly digest.

Curated work shaping how I think about offensive security, ML supply chain, and the seams between them. The reading list is evergreen; the digest goes out monthly when there's something worth saying.

Reading list

Curated · 41 entries · 6 categories
ML supply chain & steganography
Provably Secure Steganography Based on List Decoding
Pang & Bai (Tsinghua University)

Pushes provably-secure linguistic steganography toward higher embedding capacity by maintaining a list of candidate decodings rather than a single one. Directly relevant to the entropy-budget question in any LLM-mediated covert-channel design.

arxiv.org · April 2026
Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers (BadStyle)
Wei et al.

Style-level (not token- or syntax-level) backdoor triggers, generated by an LLM as a poisoned-sample synthesizer. Adds an auxiliary target loss to stabilize payload injection during fine-tuning. Evaluated against seven model families.

arxiv.org · April 2026
EvilModel: Hiding Malware Inside of Neural Network Models
Wang, Liu, Cui

The canonical reference for byte-level steganography in float32 weight tensors. Explicitly defers the decoder to a separately-deployed loader, which is the substantive limitation when read against a channel/decoder/substrate framework.

arxiv.org · 2021
BadNets: Identifying Vulnerabilities in the ML Supply Chain
Gu, Dolan-Gavitt, Garg

Origin paper for trigger-based co-trained backdoors. The decoder and the channel are baked into the network's weights together, which is why detection has to be behavioral rather than static.

arxiv.org · 2017
CSC: Turning the Adversary's Poison against Itself
Shi, Guo, Chen, Zhu, Liu & Zhou

Poisoned samples form isolated latent-space clusters early in training because triggers dominate their feature representation. CSC exploits this: cluster, segregate, relabel to a virtual class, fine-tune. Near-zero ASR across four datasets and twelve attack variants with minimal clean accuracy loss. The trigger design that makes poisoned samples distinctive to the model also makes them distinctive to an auditor.

arxiv.org · April 2026
Locking Pretrained Weights via Deep Low-Rank Residual Distillation
Sakamoto, Ablin, Danieli & Cuturi

DLR-Lock replaces each pretrained MLP with a deep low-rank residual network that increases activation memory during fine-tuning and creates architectural mismatches that frustrate standard optimization. Tested against adaptive attackers with full knowledge of the defense. Framing the goal as locking rather than watermarking shifts from post-hoc detection to resistance, which is a different threat model.

arxiv.org · May 2026
BCJR-QAT: A Differentiable Relaxation of Trellis-Coded Weight Quantization
Venugopalan Iyengar

Trellis-coded quantization encodes model weights using error-correcting code structure; the Viterbi decoding step is non-differentiable, which breaks gradient flow during QAT. BCJR-QAT replaces Viterbi with the forward-backward sum-product algorithm to restore end-to-end differentiability. Trellis-coded weight encoding is the quantization scheme most directly analogous to steganographic embedding; this advances its training-side infrastructure.

arxiv.org · May 2026
LLM red-teaming & jailbreaks
Adaptive Instruction Composition for Automated LLM Red-Teaming
Zymet et al. (Capital One AI Foundations)

Replaces random combination of crowdsourced jailbreak ingredients with a contextual-bandit learner that scores combinations based on prior success. Roughly 2,200-parameter bandit on top of SBERT embeddings. Transfers across models without retraining.

arxiv.org · April 2026
Transient Turn Injection: Stateless Multi-Turn Vulnerabilities
Rayhan & Jahan

Distributes adversarial intent across stateless turns, evading moderation that evaluates each turn independently. Notable for showing that the threat model "single-turn safety classifier" is incomplete against an attacker LLM operating across sessions.

arxiv.org · April 2026
Persistent Pre-Training Poisoning of LLMs
Zhang, Rando, Evtimov, Carlini, Tramèr et al. (Meta · ETH Zürich · CMU · Google DeepMind)

Poisoning 0.1% of pre-training data is enough for three of four backdoor objectives (DoS, belief manipulation, jailbreaking) to survive post-training. DoS persists at 0.001%. The supply-chain layer the threat model has to start at.

arxiv.org · 2024
Jailbroken Frontier Models Retain Their Capabilities
Zhu, Wang, Bao & Wei

28 jailbreaks across five benchmarks, Claude models from Haiku 4.5 to Opus 4.6. The capability tax scales inversely with model strength: Haiku 4.5 loses 33.1% average capability under jailbreaks, Opus 4.6 loses 7.7%. The most sophisticated attacks approach zero degradation. Safety assessments that rely on capability loss as a self-limiting mechanism are working from a false premise.

arxiv.org · May 2026
Jailbreaking Vision-Language Models Through the Visual Modality
Azulay, Dubiński, Li, Mittal & Gandelsman

Four attack methods targeting the visual processing path of VLMs: encoding harmful instructions as visual symbols with a legend, object substitution, text replacement in images, and visual puzzles. A visual cipher achieved 40.9% ASR on Claude-Haiku-4.5 vs. 10.7% for the text equivalent. The gap quantifies how little text-focused safety training transfers to vision.

arxiv.org · May 2026
Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing
Lin, Niu, Ji & Gao

DR-Smoothing adds a two-stage prompt processing scheme: disrupt the input, then rectify it back toward a normal distribution before passing to the model. Prior disrupt-only methods left the model seeing out-of-distribution prompts. The rectification step removes that problem while preserving disruption as the defense layer. Includes theoretical bounds on defense success probability.

arxiv.org · May 2026
Re-Triggering Safeguards within LLMs for Jailbreak Detection
Lin, Niu, Ji, Huang & Gao

Rather than training a standalone jailbreak classifier, this uses embedding disruption to re-activate the model's own internal safeguards for detection. The defense reuses what the model already knows. Effective against adaptive attacks in evaluation. Same authors as DR-Smoothing; this paper covers detection, DR-Smoothing covers the defense-side response.

arxiv.org · May 2026
CALYREX: Cross-Attention Layer Extended Transformers for System Prompt Anchoring
Li Lixing

Adds cross-attention at the final eighth of transformer layers to give system prompts a structurally distinct processing path rather than treating them identically to user input. At 8B scale: +7.4% on IFEval, +16.3% on multi-turn instruction adherence, -13% many-shot jailbreaking ASR. The architectural approach rather than a tuning one is what makes the gains scale-consistent.

arxiv.org · May 2026
Metis: Learning to Jailbreak LLMs via Self-Evolving Metacognitive Policy Optimization
Zhou, Zhao, Zhong, Liang, Chen et al.

Recasts jailbreaking as inference-time policy optimization in an adversarial decision process. A self-evolving metacognitive loop diagnoses the target's defense logic and refines the attack trajectory through structured feedback. 89.2% average ASR across 10 models including 76.0% on O1 and 78.0% on GPT-5-chat, at 8.2x lower compute than prior methods.

arxiv.org · May 2026
LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments
Zhang, Yang, Jiang, Zhang, Zhao et al.

819 test cases for evaluating safety in LLM agents operating in actual operating systems. Identifies Execution Hallucination: agents verbally refuse a request while the harmful OS-level action completes undetected. Running in real OS environments rather than simulation is what exposes this gap between stated refusal and actual execution.

arxiv.org · May 2026
When Prompts Become Payloads: SQL Injection via LLM-Driven Natural Language Interfaces
Motlagh, Hajizadeh, Majd, Najafi, Cheng & Meinel

Conversational interfaces that translate natural language to SQL queries inherit SQL injection as a threat class. The proposed defense stacks three layers: front-end prompt sanitization, a behavioral/semantic anomaly detector, and a signature layer for known patterns. The LLM translation layer that creates the vulnerability is also what makes the sanitization non-trivial.

arxiv.org · May 2026
MCP & agentic security
MCP Pitfall Lab: Developer Pitfalls in MCP Tool Server Security
Hao & Tan

Six-class pitfall taxonomy (P1–P6) split into statically-checkable (Tier-1) and trace/dataflow-dependent (Tier-2) classes. Three workflow challenges (email, document, crypto) with hardened-vs-baseline server pairs and three attack families: tool-metadata poisoning, puppet servers, image-to-tool chains.

arxiv.org · April 2026
Beyond the Protocol: Attack Vectors in the MCP Ecosystem
Song et al.

First end-to-end empirical evaluation of attacks against MCP. Four attack categories: tool poisoning, puppet attacks, rug pull, and exploitation via malicious external resources. Useful as the lay-of-the-land paper before any MCP-specific work.

arxiv.org · 2025
Trivial Trojans: Cross-Tool Exfiltration via Minimal MCP Servers
Croce & South

Concrete demonstration of cross-server data exfiltration in MCP. The barrier-to-entry argument matters: this is not a sophisticated attack class, which is the point.

arxiv.org · July 2025
Threat modeling and prompt injection in Comet
Trail of Bits

ML-centered threat modeling applied to an agentic browser. Four prompt-injection techniques against the AI assistant, all chained to exfiltrate Gmail data. The methodology — TRAIL — is more transferable than any individual finding.

blog.trailofbits.com · February 2026
Evaluating Tool Cloning in Agentic-AI Ecosystems
Kim, Jiang, Hu, Jia & Gong

8,861 repositories, 100,011 tools from MCP and Skills platforms. 60% of high-Jaccard candidates and 85% of high-ssdeep candidates are manually verified clones. Vulnerable code in a cloned tool propagates to all downstream repositories automatically. Benchmark splits built from "diverse" tool datasets may be evaluating heavily duplicated code. A supply-chain amplification pattern that pre-dates LLMs.

arxiv.org · May 2026
Oracle Poisoning: Corrupting Knowledge Graphs to Weaponise AI Agent Reasoning
Kereopa-Yorke, Diaz, Wright, Johnston, Del Rosario & Lynar

Differs from prompt injection by targeting the knowledge graph data agents reason over rather than their instructions. Six attack scenarios against a production knowledge graph with 42M nodes across nine models from three providers. All models accepted fabricated security claims at 100% under directed queries. GPT-5.1 showed 0% trust in inline evaluation but 100% under actual tool-use, which is the key finding: the delivery channel changes the attack surface.

arxiv.org · May 2026
Self-Adaptive Multi-Agent LLM-Based Security Pattern Selection for IoT Systems
Jamshidi, Khomh, Fung & Nafi

ASPO integrates LLM reasoning with deterministic enforcement inside a MAPE-K control loop. LLM agents propose mitigations; an optimization engine ensures proposals are conflict-free and resource-feasible before acting. 100% conflict-free activation on a 500-1000 decision testbed. The separation between reasoning and enforcement is the design principle worth generalizing.

arxiv.org · May 2026
Nautilus Compass: Black-box Persona Drift Detection for Production LLM Agents
Chunxiao Wang

Black-box drift detection using cosine similarity between user prompts and behavioral anchor texts, aggregated by weighted top-k mean over BGE-m3 embeddings. ROC AUC 0.83 on real session traces. Available as a Claude Code plugin and MCP server with Merkle-chained audit logging. The ~30-point gap below white-box methods is the explicit cost of not touching model weights.

arxiv.org · May 2026
CAP: Controllable Alignment Prompting for Unlearning in LLMs
Wang, Guo, Pu, Pu, Yang et al.

Machine unlearning without modifying model parameters: a prompt generator trained via reinforcement learning collaborates with the LLM to suppress target knowledge while preserving general capabilities. Works on closed-source models. Reversal is possible by revoking the prompt rather than retraining, which matters for unlearning as a compliance tool when legal hold periods end.

arxiv.org · April 2026
Inference infrastructure & ML platforms
mcp-run-python: lack of isolation, MCP takeover, Deno SSRF
Natan Nehorai (JFrog)

Two CVEs (CVE-2026-25905, CVE-2026-25904) in a popular MCP server template. The class of bug is a useful pattern: trusting that a Deno sandbox plus a containerized python runner will hold under MCP-style invocation.

research.jfrog.com · February 2026
Uncovering memory corruption in NVIDIA Triton (as a new hire)
Will Vandevanter (Trail of Bits)

Two remotely-exploitable memory-corruption bugs (CVE-2025-23310, CVE-2025-23311) in Triton's HTTP request handling, surfaced via static analysis plus chunked-encoding probing. The reminder: production inference servers are still C/C++ network services with all the attendant historical bug classes, and authentication is off by default.

blog.trailofbits.com · August 2025
Breaking NVIDIA Triton: CVE-2025-23319 vulnerability chain to RCE
Wiz Research

A multi-stage vulnerability chain in the Triton Python backend, starting from a minor information leak about shared-memory region names and escalating to unauthenticated RCE. Useful as a case study in chaining low-severity primitives into a takeover.

wiz.io · August 2025
GGUF-SSTI: Llama-Drama and the Jinja template attack surface
JFrog Security Research

Reference for CVE-2024-34359 (the chat-template Jinja RCE in llama-cpp-python) and the broader question of when loading a GGUF model can lead to server-side template injection. The case study for why loader extensions need the same threat-modeling rigor as the loader itself.

research.jfrog.com · 2024
Behavioral Consistency and Transparency Analysis on LLM API Gateways
Lin, Wan, Pei, Xu, Xu & Xue

GateScope audits third-party LLM API gateways across response content, multi-turn quality, billing accuracy, and latency. Measurement of 10 commercial gateways found undisclosed model substitutions, degraded conversation memory, pricing deviations, and inconsistent latency. The billing-accuracy finding has the clearest actionability: gateways charging for model calls that differ from what was advertised.

arxiv.org · April 2026
Separable Expert Architecture: Privacy-Preserving LLM Personalization via Composable Adapters
Schneider, Schoenegger & Bariach

Three-layer architecture: static base model, composable domain-expert LoRA adapters, and removable per-user proxy artifacts. Removing a user's artifacts returns outputs to baseline and prevents cross-user leakage, tested on Phi-3.5-mini and Llama-3.1-8B. Reframes machine unlearning as deterministic deletion of a separable artifact rather than expensive parameter updates.

arxiv.org · April 2026
RUBEN: Rule-Based Explanations for Retrieval-Augmented LLM Systems
Rorseth, Godfrey, Golab, Srivastava & Szlichta

If-then rules linking source presence and absence to RAG output behavior, with Apriori-like pruning to avoid brute-force source-combination search. The provenance angle extends beyond explainability: if you can determine which source combinations produce which outputs, you can identify which sources a poisoning attack needs to control.

arxiv.org · October 2025
CleanBase: Detecting Malicious Documents in RAG Knowledge Databases
Jin, Wang, Zou, Jia & Gong

Malicious documents crafted for the same attack-targeted question exhibit high semantic similarity to each other. CleanBase builds a similarity graph and detects cliques, with a statistically-determined threshold and theoretical error-rate bounds. A working CleanBase deployment reduces the poisoned corpus before retrieval, making retrieval rate a function of detection coverage rather than a fixed attack property.

arxiv.org · May 2026

Monthly digest

First of the month

Monthly digest pending. First issue when there's something worth saying.