Skip to content

Hardware at a glance

ComponentSpec
SuperchipNVIDIA GB10 Grace Blackwell (CPU + GPU in one package)
CPU20-core Arm: 10× Cortex-X925 + 10× Cortex-A725 (MediaTek IP)
GPUBlackwell, 6,144 CUDA cores, 5th-gen Tensor Cores + RT Cores
Compute capabilitysm_121 (matters when building NCCL or custom CUDA)
AI performanceup to 1 petaFLOP FP4 (theoretical, with sparsity)
Memory128 GB LPDDR5X, unified and coherent CPU↔GPU
CPU↔GPU linkNVLink-C2C, roughly 6× PCIe Gen5 bandwidth
Storage4 TB NVMe
Networkingdual ConnectX-7 200GbE (QSFP56), plus 1GbE management and Wi-Fi
OSDGX OS (Ubuntu base), ARM64
Single-unit model sizeup to ~200B parameters
Two unitsup to ~405B parameters over 200GbE
Four units (quad stack)512 GB unified memory, 4,000 TFLOPS FP4

The headline “1 petaFLOP” is FP4 throughput with sparsity enabled, which is the right number for low-precision inference but not a figure you will hit on every workload. Think of it as the ceiling for quantized inference, not a general compute rating.

The 128 GB of unified memory is the spec that changes how you work. It is the reason a 70B model at 4-bit (~40 GB) loads with room to spare for KV cache, where a 24 GB discrete card would force aggressive quantization or multi-card sharding. The deeper discussion lives in unified memory.

The ARM64 / DGX OS detail matters the first time you pull containers or binaries. Reach for aarch64 / arm64 image tags, not x86 builds.

The ConnectX-7 ports are ethernet-only (no InfiniBand mode) and are how you build a two- or four-Spark cluster. The networking specifics are covered in multi-Spark networking and the connect two Sparks how-to.