Skip to content

Building NCCL for Spark clusters

NCCL (the NVIDIA Collective Communications Library) is what runs the all-reduce, all-gather, and broadcast operations behind distributed training and tensor-parallel inference. On the DGX Spark it needs special handling.

Two reasons:

  1. Blackwell compute capability. The Spark’s GPU is sm_121. A generic NCCL build that does not target this architecture will not give you working, performant collectives. You must compile NCCL with sm_121 support.
  2. Three-node ring support. As of the June 2026 update, the three-Spark ring topology requires a manually built NCCL (2.30u1 or newer). The two-node direct setup is more forgiving, but the ring is not.

Building NCCL is only half the job. By default, traffic does not automatically prefer the 200GbE CX-7 interfaces over the 1GbE management network. You must pin it explicitly, and the interface name must be identical on every node:

Terminal window
# find the right interface on each node first
ibdev2netdev
# then pin all three to the SAME name across all nodes
export NCCL_SOCKET_IFNAME=<cx7-iface>
export UCX_NET_DEVICES=<cx7-iface>
export OMPI_MCA_btl_tcp_if_include=<cx7-iface>

A mismatch here is the single most common reason multi-node tests fail. The interface name differs between documented setups (the ring example uses enP7s7, the two-Spark setup uses enp1s0f1np1), so never copy the literal name, always confirm it with ibdev2netdev per node.

Each 200G CX-7 port is wired as two PCIe Gen5 x4 links, so it appears as two ~100G logical interfaces. A single flow tops out near 100G; you reach the full ~190-198G per node-pair only by running parallel sessions across both halves, with MTU 9000 end to end and IPv6 disabled on the CX-7 interfaces (which also keeps the RoCE GID indices consistent). See multi-Spark networking for the full picture.

When the workload runs in a container, give it host networking and the RoCE device:

Terminal window
docker run --network=host --device=/dev/infiniband --ulimit memlock=-1 ...

NVIDIA ships an nccl-two-sparks playbook and (under the multi-node updates) ring-aware examples. Start from the official playbooks and the Sync Cluster Assistant, which automates the network half so you only have to handle the NCCL build.