Home/Signal

m00dy.sh / Signal

A reading list and a monthly digest.

Curated work shaping how I think about offensive security, ML supply chain, and the seams between them. The reading list is evergreen; the digest goes out monthly when there's something worth saying.

Reading list

53 entries · 6 shelves

ML supply chain & steganography

ShadowPickle: Evading Machine Learning Model Scanners via Stealthy Pickle Deserialization Attacks

Pradhan, Nambiar & Soremekun

Three pickle deserialization attacks that abuse the Pickle VM external module import mechanism to execute payloads during load. The Overwritten variant evades 63 percent of scanners, up to 50 percent better than prior attacks, across ten state-of-the-art scanners and four model hosting platforms. Read this against any argument that scanning solves pickle: the format keeps the execution path, so the scanner is left guessing at intent.

arxiv.org · July 2026

Locking Pretrained Weights via Deep Low-Rank Residual Distillation

Sakamoto, Ablin, Danieli & Cuturi

DLR-Lock replaces each pretrained MLP with a deep low-rank residual network that increases activation memory during fine-tuning and creates architectural mismatches that frustrate standard optimization. Tested against adaptive attackers with full knowledge of the defense. Framing the goal as locking rather than watermarking shifts from post-hoc detection to resistance, which is a different threat model.

arxiv.org · May 2026

BCJR-QAT: A Differentiable Relaxation of Trellis-Coded Weight Quantization

Venugopalan Iyengar

Trellis-coded quantization encodes model weights using error-correcting code structure; the Viterbi decoding step is non-differentiable, which breaks gradient flow during QAT. BCJR-QAT replaces Viterbi with the forward-backward sum-product algorithm to restore end-to-end differentiability. Trellis-coded weight encoding is the quantization scheme most directly analogous to steganographic embedding; this advances its training-side infrastructure.

arxiv.org · May 2026

Undetectable Backdoors in Model Parameters: Hiding Sparse Secrets in High Dimensions

Choudhary, Patlan, Palumbo, Hooda, Fawaz & Jha

Sparse Backdoor injects a structured sparse perturbation along a randomly chosen direction into a small subset of columns per fully connected layer, then masks it with isotropic Gaussian dither. The real contribution is the hardness argument: detection reduces to Sparse PCA, so no polynomial-time detector finds it. A weight-space channel with a complexity-theoretic floor under it is a much stronger claim than the usual empirical stealth result.

arxiv.org · May 2026

Provably Secure Steganography Based on List Decoding

Pang & Bai (Tsinghua University)

Pushes provably-secure linguistic steganography toward higher embedding capacity by maintaining a list of candidate decodings rather than a single one. Directly relevant to the entropy-budget question in any LLM-mediated covert-channel design.

arxiv.org · April 2026

Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers (BadStyle)

Wei et al.

Style-level (not token- or syntax-level) backdoor triggers, generated by an LLM as a poisoned-sample synthesizer. Adds an auxiliary target loss to stabilize payload injection during fine-tuning. Evaluated against seven model families.

arxiv.org · April 2026

CSC: Turning the Adversary's Poison against Itself

Shi, Guo, Chen, Zhu, Liu & Zhou

Poisoned samples form isolated latent-space clusters early in training because triggers dominate their feature representation. CSC exploits this: cluster, segregate, relabel to a virtual class, fine-tune. Near-zero ASR across four datasets and twelve attack variants with minimal clean accuracy loss. The trigger design that makes poisoned samples distinctive to the model also makes them distinctive to an auditor.

arxiv.org · April 2026

Inference-Time Backdoors via Chat Templates: From LLM Supply Chains to Agentic System Compromise

Fogel, Hofman, Cohen & Vainshtein

Chat templates are Jinja2 programs that run on every inference call, which makes them a code channel that ships alongside the weights without touching them. Triggered backdoors dropped factual accuracy from 90 percent to 15 percent across eighteen models and hijacked tool use across 3,868 agent episodes, and the poisoned artifacts passed every scan on the largest open model distribution platform. Same attack surface as the Llama-Drama template work, aimed at distribution rather than at a parser bug.

arxiv.org · February 2026

EvilModel: Hiding Malware Inside of Neural Network Models

Wang, Liu, Cui

The canonical reference for byte-level steganography in float32 weight tensors. Explicitly defers the decoder to a separately-deployed loader, which is the substantive limitation when read against a channel/decoder/substrate framework.

arxiv.org · 2021

BadNets: Identifying Vulnerabilities in the ML Supply Chain

Gu, Dolan-Gavitt, Garg

Origin paper for trigger-based co-trained backdoors. The decoder and the channel are baked into the network's weights together, which is why detection has to be behavioral rather than static.

arxiv.org · 2017

LLM red-teaming & jailbreaks

PI-Hunter: Automated Red-Teaming for Exposing and Localizing Prompt Injections

He, Miculicich, Sharma, Fox, Lee, Tang, Pfister & Le

Generates realistic source-aware test cases for indirect prompt injection and evolves them through feedback-driven exploration, instead of sampling from a fixed library of known payloads. Localizes which injected source caused the failure rather than only reporting that the agent failed. It stays effective with defenses deployed, which is the evaluation condition most red-teaming papers quietly skip.

arxiv.org · June 2026

NRT-Bench: Benchmarking Multi-Turn Red-Teaming of LLM Operator Agents in Safety-Critical Control Rooms

Lee, Choi, Kim, Park & Kim

Puts LLM operator agents in a simulated nuclear power plant control room and runs adaptive multi-turn attacks against the operator team. Between 8.7 and 12.1 percent of attack sessions across four models ended in loss of a critical safety function, and the vulnerabilities were distinct per model rather than overlapping. The transferable part is the harness rather than the scenario: single-turn refusal rates say almost nothing about how an agent holds up over a session.

arxiv.org · June 2026

Jailbroken Frontier Models Retain Their Capabilities

Zhu, Wang, Bao & Wei

28 jailbreaks across five benchmarks, Claude models from Haiku 4.5 to Opus 4.6. The capability tax scales inversely with model strength: Haiku 4.5 loses 33.1% average capability under jailbreaks, Opus 4.6 loses 7.7%. The most sophisticated attacks approach zero degradation. Safety assessments that rely on capability loss as a self-limiting mechanism are working from a false premise.

arxiv.org · May 2026

Jailbreaking Vision-Language Models Through the Visual Modality

Azulay, Dubiński, Li, Mittal & Gandelsman

Four attack methods targeting the visual processing path of VLMs: encoding harmful instructions as visual symbols with a legend, object substitution, text replacement in images, and visual puzzles. A visual cipher achieved 40.9% ASR on Claude-Haiku-4.5 vs. 10.7% for the text equivalent. The gap quantifies how little text-focused safety training transfers to vision.

arxiv.org · May 2026

Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing

Lin, Niu, Ji & Gao

DR-Smoothing adds a two-stage prompt processing scheme: disrupt the input, then rectify it back toward a normal distribution before passing to the model. Prior disrupt-only methods left the model seeing out-of-distribution prompts. The rectification step removes that problem while preserving disruption as the defense layer. Includes theoretical bounds on defense success probability.

arxiv.org · May 2026

Re-Triggering Safeguards within LLMs for Jailbreak Detection

Lin, Niu, Ji, Huang & Gao

Rather than training a standalone jailbreak classifier, this uses embedding disruption to re-activate the model's own internal safeguards for detection. The defense reuses what the model already knows. Effective against adaptive attacks in evaluation. Same authors as DR-Smoothing; this paper covers detection, DR-Smoothing covers the defense-side response.

arxiv.org · May 2026

CALYREX: Cross-Attention Layer Extended Transformers for System Prompt Anchoring

Li Lixing

Adds cross-attention at the final eighth of transformer layers to give system prompts a structurally distinct processing path rather than treating them identically to user input. At 8B scale: +7.4% on IFEval, +16.3% on multi-turn instruction adherence, -13% many-shot jailbreaking ASR. The architectural approach rather than a tuning one is what makes the gains scale-consistent.

arxiv.org · May 2026

Metis: Learning to Jailbreak LLMs via Self-Evolving Metacognitive Policy Optimization

Zhou, Zhao, Zhong, Liang, Chen et al.

Recasts jailbreaking as inference-time policy optimization in an adversarial decision process. A self-evolving metacognitive loop diagnoses the target's defense logic and refines the attack trajectory through structured feedback. 89.2% average ASR across 10 models including 76.0% on O1 and 78.0% on GPT-5-chat, at 8.2x lower compute than prior methods.

arxiv.org · May 2026

LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments

Zhang, Yang, Jiang, Zhang, Zhao et al.

819 test cases for evaluating safety in LLM agents operating in actual operating systems. Identifies Execution Hallucination: agents verbally refuse a request while the harmful OS-level action completes undetected. Running in real OS environments rather than simulation is what exposes this gap between stated refusal and actual execution.

arxiv.org · May 2026

When Prompts Become Payloads: SQL Injection via LLM-Driven Natural Language Interfaces

Motlagh, Hajizadeh, Majd, Najafi, Cheng & Meinel

Conversational interfaces that translate natural language to SQL queries inherit SQL injection as a threat class. The proposed defense stacks three layers: front-end prompt sanitization, a behavioral/semantic anomaly detector, and a signature layer for known patterns. The LLM translation layer that creates the vulnerability is also what makes the sanitization non-trivial.

arxiv.org · May 2026

Adaptive Instruction Composition for Automated LLM Red-Teaming

Zymet et al. (Capital One AI Foundations)

Replaces random combination of crowdsourced jailbreak ingredients with a contextual-bandit learner that scores combinations based on prior success. Roughly 2,200-parameter bandit on top of SBERT embeddings. Transfers across models without retraining.

arxiv.org · April 2026

Transient Turn Injection: Stateless Multi-Turn Vulnerabilities

Rayhan & Jahan

Distributes adversarial intent across stateless turns, evading moderation that evaluates each turn independently. Notable for showing that the threat model "single-turn safety classifier" is incomplete against an attacker LLM operating across sessions.

arxiv.org · April 2026

Persistent Pre-Training Poisoning of LLMs

Zhang, Rando, Evtimov, Carlini, Tramèr et al. (Meta · ETH Zürich · CMU · Google DeepMind)

Poisoning 0.1% of pre-training data is enough for three of four backdoor objectives (DoS, belief manipulation, jailbreaking) to survive post-training. DoS persists at 0.001%. The supply-chain layer the threat model has to start at.

arxiv.org · 2024

MCP & agentic security

MCP Auto-Execution: From Git Clone to Cloud Compromise in Amazon Q VS Code Extension

Maor Dokhanian (Wiz)

The extension loaded MCP server configs from the workspace root with no consent prompt and no workspace trust check, and the servers it spawned inherited the full user environment. Opening a cloned repository was the entire exploit chain, straight through to AWS credentials, cloud CLI tokens, and the SSH agent. CVE-2026-12957, remediated in language server 1.65.0. Workspace files are untrusted input, and every agentic IDE that auto-loads tool configuration is making the same bet this one lost.

wiz.io · June 2026

Evaluating Tool Cloning in Agentic-AI Ecosystems

Kim, Jiang, Hu, Jia & Gong

8,861 repositories, 100,011 tools from MCP and Skills platforms. 60% of high-Jaccard candidates and 85% of high-ssdeep candidates are manually verified clones. Vulnerable code in a cloned tool propagates to all downstream repositories automatically. Benchmark splits built from "diverse" tool datasets may be evaluating heavily duplicated code. A supply-chain amplification pattern that pre-dates LLMs.

arxiv.org · May 2026

Oracle Poisoning: Corrupting Knowledge Graphs to Weaponise AI Agent Reasoning

Kereopa-Yorke, Diaz, Wright, Johnston, Del Rosario & Lynar

Differs from prompt injection by targeting the knowledge graph data agents reason over rather than their instructions. Six attack scenarios against a production knowledge graph with 42M nodes across nine models from three providers. All models accepted fabricated security claims at 100% under directed queries. GPT-5.1 showed 0% trust in inline evaluation but 100% under actual tool-use, which is the key finding: the delivery channel changes the attack surface.

arxiv.org · May 2026

Self-Adaptive Multi-Agent LLM-Based Security Pattern Selection for IoT Systems

Jamshidi, Khomh, Fung & Nafi

ASPO integrates LLM reasoning with deterministic enforcement inside a MAPE-K control loop. LLM agents propose mitigations; an optimization engine ensures proposals are conflict-free and resource-feasible before acting. 100% conflict-free activation on a 500-1000 decision testbed. The separation between reasoning and enforcement is the design principle worth generalizing.

arxiv.org · May 2026

Nautilus Compass: Black-box Persona Drift Detection for Production LLM Agents

Chunxiao Wang

Black-box drift detection using cosine similarity between user prompts and behavioral anchor texts, aggregated by weighted top-k mean over BGE-m3 embeddings. ROC AUC 0.83 on real session traces. Available as a Claude Code plugin and MCP server with Merkle-chained audit logging. The ~30-point gap below white-box methods is the explicit cost of not touching model weights.

arxiv.org · May 2026

MCP Pitfall Lab: Developer Pitfalls in MCP Tool Server Security

Hao & Tan

Six-class pitfall taxonomy (P1–P6) split into statically-checkable (Tier-1) and trace/dataflow-dependent (Tier-2) classes. Three workflow challenges (email, document, crypto) with hardened-vs-baseline server pairs and three attack families: tool-metadata poisoning, puppet servers, image-to-tool chains.

arxiv.org · April 2026

CAP: Controllable Alignment Prompting for Unlearning in LLMs

Wang, Guo, Pu, Pu, Yang et al.

Machine unlearning without modifying model parameters: a prompt generator trained via reinforcement learning collaborates with the LLM to suppress target knowledge while preserving general capabilities. Works on closed-source models. Reversal is possible by revoking the prompt rather than retraining, which matters for unlearning as a compliance tool when legal hold periods end.

arxiv.org · April 2026

Threat modeling and prompt injection in Comet

Trail of Bits

ML-centered threat modeling applied to an agentic browser. Four prompt-injection techniques against the AI assistant, all chained to exfiltrate Gmail data. The methodology, TRAIL, is more transferable than any individual finding.

blog.trailofbits.com · February 2026

Trivial Trojans: Cross-Tool Exfiltration via Minimal MCP Servers

Croce & South

Concrete demonstration of cross-server data exfiltration in MCP. The barrier-to-entry argument matters: this is not a sophisticated attack class, which is the point.

arxiv.org · July 2025

Beyond the Protocol: Attack Vectors in the MCP Ecosystem

Song et al.

First end-to-end empirical evaluation of attacks against MCP. Four attack categories: tool poisoning, puppet attacks, rug pull, and exploitation via malicious external resources. Useful as the lay-of-the-land paper before any MCP-specific work.

arxiv.org · 2025

Inference infrastructure & ML platforms

Image Prompt Reconstruction Attacks on Distributed MLLM Inference Frameworks

Luo, Chang, Wei, Wu, Gao, Qiu, Yu & Liu

Intermediate embeddings exchanged between participants in distributed multimodal inference leak the input image, not just the text prompt studied previously. Embedding extraction reaches near-total accuracy across almost all layers, then two attacks reconstruct from it: patch assembly for pixel-level output, embedding-guided diffusion for semantic. Evaluated on Gemma 3, Phi 4 Multimodal, Qwen 2.5 VL, and Llama 4 Scout. Worth reading if you treat a disaggregated inference deployment as a trust boundary, which most deployments do not.

arxiv.org · June 2026

CleanBase: Detecting Malicious Documents in RAG Knowledge Databases

Jin, Wang, Zou, Jia & Gong

Malicious documents crafted for the same attack-targeted question exhibit high semantic similarity to each other. CleanBase builds a similarity graph and detects cliques, with a statistically-determined threshold and theoretical error-rate bounds. A working CleanBase deployment reduces the poisoned corpus before retrieval, making retrieval rate a function of detection coverage rather than a fixed attack property.

arxiv.org · May 2026

Continuous Discovery of Vulnerabilities in LLM Serving Systems with Fuzzing (GRIEF)

Zhao, Zhao, Zhang, Liu & Mazurek

A greybox fuzzer for LLM inference engines that treats timed multi-request traces as first-class inputs, which is the right abstraction for bugs that only exist under concurrency. Fifteen findings across vLLM and SGLang, ten confirmed by developers and two assigned CVEs, spanning KV-cache isolation failures, cross-request performance degradation, and crashes. The closest published work to what Crucible does, pointed at the serving layer instead of the parser: the state is shared and the input is a schedule rather than a file.

arxiv.org · May 2026

Behavioral Consistency and Transparency Analysis on LLM API Gateways

Lin, Wan, Pei, Xu, Xu & Xue

GateScope audits third-party LLM API gateways across response content, multi-turn quality, billing accuracy, and latency. Measurement of 10 commercial gateways found undisclosed model substitutions, degraded conversation memory, pricing deviations, and inconsistent latency. The billing-accuracy finding has the clearest actionability: gateways charging for model calls that differ from what was advertised.

arxiv.org · April 2026

Separable Expert Architecture: Privacy-Preserving LLM Personalization via Composable Adapters

Schneider, Schoenegger & Bariach

Three-layer architecture: static base model, composable domain-expert LoRA adapters, and removable per-user proxy artifacts. Removing a user's artifacts returns outputs to baseline and prevents cross-user leakage, tested on Phi-3.5-mini and Llama-3.1-8B. Reframes machine unlearning as deterministic deletion of a separable artifact rather than expensive parameter updates.

arxiv.org · April 2026

mcp-run-python: lack of isolation, MCP takeover, Deno SSRF

Natan Nehorai (JFrog)

Two CVEs (CVE-2026-25905, CVE-2026-25904) in a popular MCP server template. The class of bug is a useful pattern: trusting that a Deno sandbox plus a containerized python runner will hold under MCP-style invocation.

research.jfrog.com · February 2026

RUBEN: Rule-Based Explanations for Retrieval-Augmented LLM Systems

Rorseth, Godfrey, Golab, Srivastava & Szlichta

If-then rules linking source presence and absence to RAG output behavior, with Apriori-like pruning to avoid brute-force source-combination search. The provenance angle extends beyond explainability: if you can determine which source combinations produce which outputs, you can identify which sources a poisoning attack needs to control.

arxiv.org · October 2025

Uncovering memory corruption in NVIDIA Triton (as a new hire)

Will Vandevanter (Trail of Bits)

Two remotely-exploitable memory-corruption bugs (CVE-2025-23310, CVE-2025-23311) in Triton's HTTP request handling, surfaced via static analysis plus chunked-encoding probing. The reminder: production inference servers are still C/C++ network services with all the attendant historical bug classes, and authentication is off by default.

blog.trailofbits.com · August 2025

Breaking NVIDIA Triton: CVE-2025-23319 vulnerability chain to RCE

Wiz Research

A multi-stage vulnerability chain in the Triton Python backend, starting from a minor information leak about shared-memory region names and escalating to unauthenticated RCE. Useful as a case study in chaining low-severity primitives into a takeover.

wiz.io · August 2025

GGUF-SSTI: Llama-Drama and the Jinja template attack surface

JFrog Security Research

Reference for CVE-2024-34359 (the chat-template Jinja RCE in llama-cpp-python) and the broader question of when loading a GGUF model can lead to server-side template injection. The case study for why loader extensions need the same threat-modeling rigor as the loader itself.

research.jfrog.com · 2024

Identity, AD, and lateral movement

Clustered Points of Failure

Garrett Foster (SpecterOps)

Cluster Name Objects and Virtual Cluster Objects store encrypted resource credentials on every node so that failover works, which means compromising one node hands you the credentials for the whole cluster. Clusters host SQL Server, ADFS, Configuration Manager, and Exchange, so cluster compromise reads as domain compromise. 82 percent of BloodHound Enterprise clients had at least one cluster, averaging 141 per environment. Availability engineering and credential isolation are in direct tension, and availability won.

specterops.io · July 2026

There and Back Again: An Operators Guide on NTLM Relaying Egress

Logan Goins (SpecterOps)

Coerce SMB or WebDAV authentication out of the target network to a cloud-hosted VM, forward it back through red team infrastructure, and relay it into ADCS web enrollment or LDAP without relay protections. The point is that relay tradecraft still works when local relay is off the table, which is the normal condition from a low-privilege C2 agent behind a host firewall. Mitigations are the familiar four: LDAP signing and channel binding, EPA on ADCS endpoints, outbound SMB egress filtering, and host firewall rules.

specterops.io · July 2026

Attack Paths Don't Stop at Identity Providers

Jared Atkinson (SpecterOps)

Modeling Okta in BloodHound Enterprise alongside AD, Entra, GitHub. The argument: identity boundaries between platforms are where attack paths actually live, and treating any single platform in isolation underrepresents real risk.

specterops.io · March 2026

AdminSDHolder: Misconceptions, Misconfigurations, and Myths

Jim Sykora (SpecterOps)

Long-form correction of decades of incorrect documentation around AD's AdminSDHolder mechanism. The kind of historical-grounding piece that's useful before doing anything privileged-account-related on AD engagements.

specterops.io · October 2025

Certified Pre-Owned

Will Schroeder & Lee Christensen

The AD CS paper that opened up the modern wave of ADCS work. Still the cleanest framing of what a tooling-up problem looks like before any tools exist.

specterops.io · 2021

AI x security writing

Atlas: Wiz's autonomous AI agent for vulnerability research

Nir Ohfeld & Yuval Avrahami (Wiz)

Four stages: threat model the attack surface with code property graphs, fan out independent agents to test hypotheses in parallel, contest every finding adversarially before accepting it, then prove it with a working exploit in a real execution environment. 90.9 percent on CyberGym and over 200 previously unknown vulnerabilities in heavily audited projects including Kubernetes, gRPC, and the Linux kernel. The design claim matches my own experience: the durable advantage is the harness around the model, not the model.

wiz.io · July 2026

Fast Remediation Is the New Trust Model: JFrog and OpenAI Collaboration on Zero-Day Security Findings

Yoav Landman (JFrog)

OpenAI models autonomously found previously unknown vulnerabilities in self-hosted Artifactory during a capability evaluation, and the fixes shipped in 7.161 and above. Landman's argument is that the trust model has to move to remediation speed, because a model-discovered zero-day sitting in a vendor queue is pure attacker upside. Read next to the Atlas post: this is the month AI-discovered vulnerabilities became an operational question for defenders instead of a thought experiment.

jfrog.com · July 2026

State of AI in the Cloud 2026

Wiz Research

Infrastructure-side measurements of AI adoption: 81% of cloud environments use managed AI services, 90% run self-hosted, 80% have MCP servers. The framing of AI as accumulated rather than adopted is a useful governance lens.

wiz.io · April 2026

GitHub RCE via X-Stat header injection (CVE-2026-3854)

Wiz Research

An authenticated git push achieves RCE on GitHub's backend through a delimiter-based internal protocol. Notable also as one of the first critical vulnerabilities the team credits to AI-augmented reverse engineering.

wiz.io · April 2026

Trailmark: turning code into security-analysis graphs

Trail of Bits

Tree-sitter plus rustworkx, packaged as Claude Code skills for blast-radius and taint-propagation analysis. Useful as a reference for how graph reasoning composes with LLM agents in a security workflow.

blog.trailofbits.com · April 2026

Monthly digest

First of the month

Monthly digest pending. First issue when there's something worth saying.