Multi-Spark networking

A single Spark holds models up to about 200B parameters. The reason two of them can reach ~405B, and four can reach the largest open-weight models, is the ConnectX-7 networking built into every unit. This page explains the shape of that, so the connect two Sparks how-to reads as a recipe rather than a mystery.

Front view of two stacked DGX Spark units

Two DGX Sparks stacked. Each unit contributes its 128 GB of unified memory to the pair. Image: ServeTheHome.

The link is ConnectX-7 ethernet running RoCE

Each Spark has a dual-port NVIDIA ConnectX-7 adapter capable of 200GbE. On the Spark these ports are ethernet-only; there is no InfiniBand mode. To get the low latency that distributed GPU work needs, you run RoCE (RDMA over Converged Ethernet) over them. RDMA lets one node read and write another node’s memory without going through the CPU and kernel network stack on every transfer, which is what keeps GPU-to-GPU communication fast enough to be worth doing.

For two units you connect the ports directly with an approved QSFP cable: a point-to-point link, no switch required. Three units can form a ring, and larger counts go through a switch.

Why the fast interface needs babysitting

A Spark has two very different networks: the 1GbE management interface you used for setup, and the 200GbE CX-7 link. By default, software does not automatically know to send heavy GPU traffic over the fast one. So multi-node setups explicitly pin the traffic: NCCL environment variables name the CX-7 interface, and the netplan configuration assigns it a stable address. Get this wrong and your “200GbE cluster” quietly runs collectives over the 1GbE management port at a fraction of the speed, with no error to tell you.

This is also why interface identification matters so much in the how-to. Each physical QSFP port surfaces under more than one Linux interface name, and ibdev2netdev is the tool that maps RoCE devices to the right ethernet interface.

NCCL has to match the silicon

GPU collective operations (all-reduce, all-gather, and friends) run through NCCL, the NVIDIA Collective Communications Library. On the Spark, NCCL must be built targeting Blackwell’s compute capability sm_121. A generic NCCL build that does not target this architecture will not give you working, performant collectives. This is a recurring theme on new silicon: the collective library has to be compiled for the exact GPU generation.

Topology by node count

The whole topology question is dictated by one fact: each Spark has exactly two CX-7 ports. That is enough to cable directly to two neighbors, no more.

Nodes	Topology	Switch?	Pooled memory
2	direct cable	no	256 GB
3	ring (3 cables, both ports)	no	384 GB
4	star through a 200GbE RoCE switch	yes	512 GB

Two nodes is a single cable. Three nodes is the sweet spot for switchless clustering: wire them in a ring and, because three nodes in a ring are all mutually adjacent, it behaves like a full mesh with no multi-hop routing. The connect three Sparks how-to has the cabling.

Four is where it changes character. To fully mesh four nodes each would need three ports, but there are only two, so you run out of ports and need a managed 200GbE RoCE switch. The switch buys you the 512 GB pool for 400B-plus models, but it adds a hop of latency that eats into tensor-parallel performance. Some people chase switchless four-node meshes with 200GBASE-SR4 optics and MPO breakout cables to dodge that latency, but the switch is the supported, documented path.

The catch: 200G is really two 100G halves

Here is the most important tuning detail, and the one that surprises people. The GB10 wires each CX-7 port not as a single PCIe Gen5 x8 link but as two x4 links. So each 200G port surfaces as two logical ~100G interfaces, and the whole NIC is PCIe-capped at 200G usable no matter how you cable it.

The consequence: a single flow only ever gets ~100G. iperf with one stream tops out around 100G and people conclude their cluster is broken. It is not. You reach the full ~196-198G per node-pair only by driving both logical halves concurrently (parallel sessions, one per half). With that plus MTU 9000 end to end and IPv6 disabled on the CX-7 interfaces (which keeps the RoCE GID indices stable), NCCL all-reduce reaches the ~190G class.

What it unlocks, and the honest limits

With the link up, the pair behaves as one machine with double the unified memory, addressable through tensor parallelism or model partitioning. That is what lets a 405B model run across two Sparks that neither could hold alone. Distributed inference stacks (vLLM with Ray, TensorRT-LLM multi-node) and distributed fine-tuning both sit on top of this foundation.

The limit to keep in mind: 200GbE is fast for a desktop interconnect but slow compared to the NVLink fabric inside a real multi-GPU server. Splitting a model across the link adds communication overhead on every token. Two Sparks let you run something you otherwise could not run at all, which is the point, but they are not the same as a single box with twice the memory bandwidth. As always with the Spark, capacity is the win and bandwidth is the constraint.