Hardware at a glance
Spec sheet
Section titled “Spec sheet”| Component | Spec |
|---|---|
| Superchip | NVIDIA GB10 Grace Blackwell (CPU + GPU in one package) |
| CPU | 20-core Arm: 10× Cortex-X925 + 10× Cortex-A725 (MediaTek IP) |
| GPU | Blackwell, 6,144 CUDA cores, 5th-gen Tensor Cores + RT Cores |
| Compute capability | sm_121 (matters when building NCCL or custom CUDA) |
| AI performance | up to 1 petaFLOP FP4 (theoretical, with sparsity) |
| Memory | 128 GB LPDDR5X, unified and coherent CPU↔GPU |
| CPU↔GPU link | NVLink-C2C, roughly 6× PCIe Gen5 bandwidth |
| Storage | 4 TB NVMe |
| Networking | dual ConnectX-7 200GbE (QSFP56), plus 1GbE management and Wi-Fi |
| OS | DGX OS (Ubuntu base), ARM64 |
| Single-unit model size | up to ~200B parameters |
| Two units | up to ~405B parameters over 200GbE |
| Four units (quad stack) | 512 GB unified memory, 4,000 TFLOPS FP4 |
What the numbers mean in practice
Section titled “What the numbers mean in practice”The headline “1 petaFLOP” is FP4 throughput with sparsity enabled, which is the right number for low-precision inference but not a figure you will hit on every workload. Think of it as the ceiling for quantized inference, not a general compute rating.
The 128 GB of unified memory is the spec that changes how you work. It is the reason a 70B model at 4-bit (~40 GB) loads with room to spare for KV cache, where a 24 GB discrete card would force aggressive quantization or multi-card sharding. The deeper discussion lives in unified memory.
The ARM64 / DGX OS detail matters the first time you pull containers or binaries. Reach for aarch64 / arm64 image tags, not x86 builds.
The ConnectX-7 ports are ethernet-only (no InfiniBand mode) and are how you build a two- or four-Spark cluster. The networking specifics are covered in multi-Spark networking and the connect two Sparks how-to.