Skip to content

Connect three Sparks (ring)

You have three DGX Sparks and want them to act as one machine without buying a switch. Three nodes is the sweet spot for switchless clustering: each Spark has two ConnectX-7 ports, so three of them wire into a ring where every node is directly cabled to the other two.

This was made official in the June 2026 update (NCCL 2.30u1 added three-node ring support), and the combined pool reaches 512 GB of unified memory for 400B-plus models.

  • Three DGX Sparks on DGX OS with NVIDIA drivers.
  • Three approved QSFP cables (the same Amphenol/Luxshare cables from the two-Spark how-to).
  • sudo on all three, matching usernames across nodes (multi-node tooling and passwordless SSH assume it).

Each Spark has two CX-7 cages. Call the one nearest the ethernet jack Port0 and the far one Port1. Wire Port0 of each node to Port1 of the next:

  1. Node1 Port0 → Node2 Port1
  2. Node2 Port0 → Node3 Port1
  3. Node3 Port0 → Node1 Port1

That closes the loop. Every port carries a full 200GbE link, and each physical port presents two logical RoCE interfaces (four per machine), which matters for bandwidth tuning (see multi-Spark networking).

Let the Cluster Assistant do the config Recommended

Section titled “Let the Cluster Assistant do the config ”

The fastest path is the Cluster Assistant in NVIDIA Sync. Starting from devices already enrolled in Sync, it runs a guided workflow that handles the parts that are tedious to do by hand:

  • system readiness checks (OTA version, sudo access)
  • CX-7 topology detection (an LLDP/BPDU probe runs on each node in parallel)
  • IP planning, deconfliction, and netplan application
  • bandwidth and latency validation with ib_write_bw / ib_write_lat
  • passwordless SSH between nodes, keyed over the CX-7 fabric

When it finishes, the three nodes have a configured RoCE network and node-to-node SSH, ready for your workload.

The other recurring footgun is interface-name consistency. The ring example pins the fast interface to a different name than the two-Spark setup (enP7s7 vs enp1s0f1np1), and these three variables must be identical across all nodes or the collective test fails:

Terminal window
export NCCL_SOCKET_IFNAME=<your-cx7-iface>
export UCX_NET_DEVICES=<your-cx7-iface>
export OMPI_MCA_btl_tcp_if_include=<your-cx7-iface>

Use ibdev2netdev on each node and confirm the name matches before running anything.

Run an NCCL all_reduce across the three nodes. A healthy ring lands in the ~190 Gb/s class once you are driving both logical halves of each port (more on that in multi-Spark networking). From here the cluster is ready for distributed inference (vLLM/Ray, TensorRT-LLM) or distributed training.

Four nodes generally needs a managed 200GbE RoCE switch, because you run out of ports before you can build a full mesh. That is a different setup with its own tradeoffs, covered in multi-Spark networking.