Building NCCL for Spark clusters

NCCL (the NVIDIA Collective Communications Library) is what runs the all-reduce, all-gather, and broadcast operations behind distributed training and tensor-parallel inference. On the DGX Spark it needs special handling.

Why build from source

Two reasons:

Blackwell compute capability. The Spark’s GPU is sm_121. A generic NCCL build that does not target this architecture will not give you working, performant collectives. You must compile NCCL with sm_121 support.
Three-node ring support. As of the June 2026 update, the three-Spark ring topology requires a manually built NCCL (2.30u1 or newer). The two-node direct setup is more forgiving, but the ring is not.

Route collectives over the fast fabric

Building NCCL is only half the job. By default, traffic does not automatically prefer the 200GbE CX-7 interfaces over the 1GbE management network. You must pin it explicitly, and the interface name must be identical on every node:

# find the right interface on each node first
ibdev2netdev

# then pin all three to the SAME name across all nodes
export NCCL_SOCKET_IFNAME=<cx7-iface>
export UCX_NET_DEVICES=<cx7-iface>
export OMPI_MCA_btl_tcp_if_include=<cx7-iface>

A mismatch here is the single most common reason multi-node tests fail. The interface name differs between documented setups (the ring example uses enP7s7, the two-Spark setup uses enp1s0f1np1), so never copy the literal name, always confirm it with ibdev2netdev per node.

Drive both logical halves

Each 200G CX-7 port is wired as two PCIe Gen5 x4 links, so it appears as two ~100G logical interfaces. A single flow tops out near 100G; you reach the full ~190-198G per node-pair only by running parallel sessions across both halves, with MTU 9000 end to end and IPv6 disabled on the CX-7 interfaces (which also keeps the RoCE GID indices consistent). See multi-Spark networking for the full picture.

In Docker

When the workload runs in a container, give it host networking and the RoCE device:

docker run --network=host --device=/dev/infiniband --ulimit memlock=-1 ...

Reference

NVIDIA ships an nccl-two-sparks playbook and (under the multi-node updates) ring-aware examples. Start from the official playbooks and the Sync Cluster Assistant, which automates the network half so you only have to handle the NCCL build.