Connect three Sparks (ring)

You have three DGX Sparks and want them to act as one machine without buying a switch. Three nodes is the sweet spot for switchless clustering: each Spark has two ConnectX-7 ports, so three of them wire into a ring where every node is directly cabled to the other two.

This was made official in the June 2026 update (NCCL 2.30u1 added three-node ring support), and the combined pool reaches 512 GB of unified memory for 400B-plus models.

What you need

Three DGX Sparks on DGX OS with NVIDIA drivers.
Three approved QSFP cables (the same Amphenol/Luxshare cables from the two-Spark how-to).
sudo on all three, matching usernames across nodes (multi-node tooling and passwordless SSH assume it).

Cable the ring

Each Spark has two CX-7 cages. Call the one nearest the ethernet jack Port0 and the far one Port1. Wire Port0 of each node to Port1 of the next:

Node1 Port0 → Node2 Port1
Node2 Port0 → Node3 Port1
Node3 Port0 → Node1 Port1

That closes the loop. Every port carries a full 200GbE link, and each physical port presents two logical RoCE interfaces (four per machine), which matters for bandwidth tuning (see multi-Spark networking).

Let the Cluster Assistant do the config Recommended

The fastest path is the Cluster Assistant in NVIDIA Sync. Starting from devices already enrolled in Sync, it runs a guided workflow that handles the parts that are tedious to do by hand:

system readiness checks (OTA version, sudo access)
CX-7 topology detection (an LLDP/BPDU probe runs on each node in parallel)
IP planning, deconfliction, and netplan application
bandwidth and latency validation with ib_write_bw / ib_write_lat
passwordless SSH between nodes, keyed over the CX-7 fabric

When it finishes, the three nodes have a configured RoCE network and node-to-node SSH, ready for your workload.

Build NCCL for the ring

The other recurring footgun is interface-name consistency. The ring example pins the fast interface to a different name than the two-Spark setup (enP7s7 vs enp1s0f1np1), and these three variables must be identical across all nodes or the collective test fails:

export NCCL_SOCKET_IFNAME=<your-cx7-iface>
export UCX_NET_DEVICES=<your-cx7-iface>
export OMPI_MCA_btl_tcp_if_include=<your-cx7-iface>

Use ibdev2netdev on each node and confirm the name matches before running anything.

Verify

Run an NCCL all_reduce across the three nodes. A healthy ring lands in the ~190 Gb/s class once you are driving both logical halves of each port (more on that in multi-Spark networking). From here the cluster is ready for distributed inference (vLLM/Ray, TensorRT-LLM) or distributed training.

Going to four

Four nodes generally needs a managed 200GbE RoCE switch, because you run out of ports before you can build a full mesh. That is a different setup with its own tradeoffs, covered in multi-Spark networking.

Why this works Topology by node count, and the 200G-is-really-two-100G-halves tuning detail.

Build NCCL for the ring The sm_121 from-source build and the interface-name pinning that trips people up.

The two-Spark case The simpler direct-cable setup, if you only have a pair.