Hardware at a glance

Spec sheet

Component	Spec
Superchip	NVIDIA GB10 Grace Blackwell (CPU + GPU in one package)
CPU	20-core Arm: 10× Cortex-X925 + 10× Cortex-A725 (MediaTek IP)
GPU	Blackwell, 6,144 CUDA cores, 5th-gen Tensor Cores + RT Cores
Compute capability	`sm_121` (matters when building NCCL or custom CUDA)
AI performance	up to 1 petaFLOP FP4 (theoretical, with sparsity)
Memory	128 GB LPDDR5X, unified and coherent CPU↔GPU
CPU↔GPU link	NVLink-C2C, roughly 6× PCIe Gen5 bandwidth
Storage	4 TB NVMe
Networking	dual ConnectX-7 200GbE (QSFP56), plus 1GbE management and Wi-Fi
OS	DGX OS (Ubuntu base), ARM64
Single-unit model size	up to ~200B parameters
Two units	up to ~405B parameters over 200GbE
Four units (quad stack)	512 GB unified memory, 4,000 TFLOPS FP4

What the numbers mean in practice

The headline “1 petaFLOP” is FP4 throughput with sparsity enabled, which is the right number for low-precision inference but not a figure you will hit on every workload. Think of it as the ceiling for quantized inference, not a general compute rating.

The 128 GB of unified memory is the spec that changes how you work. It is the reason a 70B model at 4-bit (~40 GB) loads with room to spare for KV cache, where a 24 GB discrete card would force aggressive quantization or multi-card sharding. The deeper discussion lives in unified memory.

The ARM64 / DGX OS detail matters the first time you pull containers or binaries. Reach for aarch64 / arm64 image tags, not x86 builds.

The ConnectX-7 ports are ethernet-only (no InfiniBand mode) and are how you build a two- or four-Spark cluster. The networking specifics are covered in multi-Spark networking and the connect two Sparks how-to.