Unified memory, and why it changes the math
The single most important thing about the DGX Spark is how it handles memory. Understanding it explains most of what the box is good at and where it falls short.
The normal GPU memory model
Section titled “The normal GPU memory model”On a conventional workstation, the CPU has its own system RAM and the GPU has its own VRAM. They are separate pools joined by a PCIe bus. When you load a model onto the GPU, every byte of weights and KV cache has to fit inside the GPU’s VRAM. If the model is bigger than VRAM, you have three options, all unpleasant: quantize the model down until it fits, split it across multiple GPUs, or offload layers to CPU RAM and pay a steep PCIe-transfer penalty on every token.
That VRAM number is a hard wall. A 24 GB card simply cannot hold a 70B model at FP16, no matter how much system RAM the machine has.
What the GB10 does instead
Section titled “What the GB10 does instead”The GB10 Grace Blackwell Superchip puts the Grace CPU and the Blackwell GPU in one package and gives them a single shared 128 GB pool of LPDDR5X. The two are joined by NVLink-C2C, a coherent interconnect with roughly six times the bandwidth of PCIe Gen5. “Coherent” is the key word: the GPU and CPU see the same memory with the same addresses, so there is no copy step to move data from “CPU memory” to “GPU memory.” There is just memory.
The practical consequence is that the VRAM wall disappears. The GPU can use the entire 128 GB. A 70B model at 4-bit (~35 GB of weights) loads with plenty of room left for a long context window, and you never touched a quantization knob to make it fit.
The catch: bandwidth, not capacity
Section titled “The catch: bandwidth, not capacity”Capacity is generous; bandwidth is the constraint. LPDDR5X is slower than the GDDR7 or HBM stacks on a discrete datacenter GPU. Token generation is fundamentally a memory-bandwidth-bound operation: to produce each token the model must stream its weights through the compute units. Slower memory means fewer tokens per second.
So the honest framing of the Spark is this: it will hold models that a single discrete card cannot, and it will run them at moderate speed rather than blazing speed. For local development, prototyping, fine-tuning, and personal-scale serving, holding the model at all is the thing that matters, and moderate speed is fine. For maximum-throughput production serving of a model that already fits in a fast card’s VRAM, a discrete card wins.
Why this is the right tradeoff for a desk
Section titled “Why this is the right tradeoff for a desk”The whole point of the Spark is to put genuinely large models within reach of one person at a desk, quietly, without a cloud bill or a server room. Trading peak bandwidth for the ability to load a 200B-parameter model locally is exactly the bargain that makes that possible.