FP4, sparsity, and the petaFLOP claim

The headline figure for the DGX Spark is “up to 1 petaFLOP of AI performance.” That number is real but specific, and it helps to know exactly what it measures.

It is an FP4 number

The petaFLOP figure is FP4 throughput: four-bit floating point. Blackwell’s 5th-generation Tensor Cores added native support for very low-precision formats, and FP4 is the lowest. Lower precision means each operation moves and computes fewer bits, so the raw operations-per-second count goes up dramatically compared to FP16 or FP8.

This is not a sleight of hand. Low-precision inference is genuinely how modern models are served efficiently. But it does mean the petaFLOP is the ceiling for FP4 work, not a general-purpose compute rating. The same chip doing FP16 training math operates at a much lower FLOP count.

It assumes sparsity

NVIDIA’s number is the theoretical peak with the sparsity feature enabled. Structured sparsity lets the Tensor Cores skip a patterned subset of zero weights, roughly doubling throughput on models that have been trained or pruned to exploit it. A dense model that does not use structured sparsity will not reach the sparse peak.

So “1 petaFLOP” expands to “up to 1 petaFLOP, FP4, with structured sparsity, theoretical peak.” Read it as the optimistic ceiling for the most favorable workload, the way you read any vendor peak-FLOP figure.

Why FP4 matters for what you can run

Precision interacts directly with the unified memory story. Lower precision shrinks the weights:

Precision	Bytes per parameter	70B model weights
FP16	2	~140 GB
FP8	1	~70 GB
FP4	0.5	~35 GB

FP4 does double duty: it makes the Tensor Cores faster and it halves the memory footprint versus FP8, letting even larger models fit in the 128 GB pool. The cost is precision loss, which good quantization schemes (and formats like NVFP4) work hard to minimize. NVIDIA ships an NVFP4 quantization playbook for exactly this.

The takeaway

The petaFLOP claim is a fair description of the chip’s FP4 ceiling. When you are reasoning about real workloads, anchor on two things instead: how much memory your model needs at the precision you will actually run, and that token generation is bandwidth-bound. Those two facts predict your experience far better than the peak-FLOP headline.