The Format That Got It Right

Over the past several months I ran structure-aware fuzzing campaigns against eight ML inference projects. The targets were llama.cpp, whisper.cpp, stable-diffusion.cpp, PyTorch/libtorch, TFLite, Apple MLX, Ollama, and gguf-tools. Across those targets I found more than 60 security findings, including heap overflows, stack overflows, division by zero, reachable assertions, and one CVSS 9.8 RCE.

After a known implementation bug was fixed, one format produced zero crashes across 38 million inputs in the reference implementation and another 9.7 million inputs in an independent C++ loader.

That format is SafeTensors.

This is not a story about one lucky run. It is a story about what happens when a model format treats untrusted input as untrusted input, and what we can learn from the difference.

The numbers

Crucible is a structure-aware fuzzer I built for ML model parsers. The workflow is simple. Build the target with AddressSanitizer and UBSan, write a harness that feeds libFuzzer output directly into the parser entry point, run for hours or days, collect crashes, and minimize proof-of-concept inputs.

For GGUF, the format used by llama.cpp, whisper.cpp, stable-diffusion.cpp, and Ollama, the first campaign found bugs within minutes. The parser made direct assumptions about input the spec did not enforce. Key lengths were taken at face value. Allocation sizes came from the file header. Assertions stood in for proper error returns. Across the broader campaign, the same pattern kept showing up in formats that treated model files as trusted input.

I ran two campaigns against SafeTensors. One targeted the reference Rust implementation. The other targeted Apple MLX’s C++ loader.

Target	Inputs	Runtime	Result
SafeTensors Rust	38,441,267	1,801 seconds	0 crashes
Apple MLX loader	9,706,835	1,801 seconds	0 crashes

The MLX campaign instrumented 893,186 program counter addresses. Its corpus grew from 55 to 1,529 inputs during the run, and coverage was still expanding at the end. The harness covered the parsing path, metadata validation, and downstream tensor access calls. This was not a campaign that missed the surface. The GGUF campaigns found bugs at execution counts an order of magnitude smaller than this. If SafeTensors had the same class of shallow parser bugs, I would expect them to appear.

The design choices made a difference.

What SafeTensors does differently

SafeTensors was designed in 2022 to avoid the problems with pickle-based model serialization. Its safety properties are documented in the spec and visible in the implementations I tested.

Header size is bounded before allocation. The first 8 bytes of the file encode the header length. Both the Rust and MLX implementations reject files with a header length above 100 MB before reading the header body. A crafted large-length field cannot force a large heap allocation because the gate is hit first.

Offsets are checked before tensors exist. Each tensor metadata entry includes a data_offsets field with start and end byte positions in the data region. The parser checks that the end offset is within the file and that the span matches dtype * product(shape) before constructing a tensor object. A file that claims a 10 GB tensor inside a 1 KB artifact fails before allocation.

Input bytes do not drive pointer arithmetic before validation. The JSON header is parsed as a complete unit, validated, then acted on. There are no variable-length fields whose lengths feed pointer movement before the parser has checked the surrounding structure.

The file contains no executable content. Pickle has GLOBAL and REDUCE behavior that can invoke Python functions during deserialization. TorchScript embeds a source archive. SafeTensors stores tensor bytes and metadata. Loading the file does not create an execution path.

None of these choices are exotic. They are ordinary input-validation decisions made at format-design time. The contrast with GGUF is sharp. GGUF reads key lengths from the file, value sizes from the file, and tensor counts from the file, then acts on them before every consistency check has happened. That architecture produces bugs. SafeTensors makes those bugs harder to write.

Even safe formats can be misimplemented

One earlier finding is worth mentioning. It predates the clean campaigns, and the later runs re-validated that the fix worked.

In an earlier implementation path, a crafted SafeTensors file with a mismatched data_offsets and shape pair could trigger an oversized allocation before the right bounds check fired. The format’s safety properties are not magic properties of the wire format alone. They depend on validators running in the right order. That implementation initially got one path wrong. Current HEAD gets it right.

The 38 million input campaign is a test of the current validator. It is not proof that SafeTensors is impossible to misimplement. It is evidence that the current implementations are doing the important things in the right order.

What this means when choosing a format

The ML ecosystem is converging on SafeTensors for a reason. The Hugging Face ecosystem has largely standardized on it over pickle-based checkpoints. llama.cpp added a SafeTensors loader. The security story is part of the adoption story, even when people do not say that part out loud.

The lesson for anyone building or adopting a model format is plain. The format choice is a security decision.

GGUF is expressive, compact, and well supported. It has also produced 27 findings across model-loading and inference-adjacent surfaces under a few months of fuzzing. SafeTensors has structural limits. It has no executable loader behavior, no rich object graph, no arbitrary Python objects, and no cross-file reference mechanism. Those limits are not incidental. They are the source of the safety properties.

You can have a format that tries to do everything, or you can have a format that is hard to exploit. SafeTensors made the narrower choice. The fuzzing results are the receipt.

The broader read

The interesting lesson is not that Rust is better than C++ or that SafeTensors is flawless. The MLX result matters because an independent C++ loader also survived. The result points back to the format contract.

Good parser security starts before implementation. It starts with a format that bounds lengths before allocation, validates offsets before materialization, separates metadata from execution, and gives implementers fewer dangerous choices.

SafeTensors got that right.