Serve a local LLM

You want a running inference server with an OpenAI-compatible API, so existing tools and your own code can talk to models hosted on the Spark.

There are several good options. Pick by how much control you want.

Ollama has native ARM64 and CUDA support and exposes an OpenAI-compatible API on port 11434.

# pull a model
ollama pull qwen3:32b

# it is already serving — test the OpenAI-compatible endpoint
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3:32b","messages":[{"role":"user","content":"Hello"}]}'

Good for: getting running in minutes, swapping models freely, desktop chat front-ends.

Build llama.cpp with CUDA and serve a GGUF model with an OpenAI-compatible server. NVIDIA has a dedicated playbook.

Good for: maximum control over quantization and sampling, running specific GGUF builds, squeezing the most out of the memory budget.

Sizing the model to the memory

With 128 GB of unified memory you have far more headroom than a typical workstation, but the model weights plus KV cache still have to fit. A rough rule for the weights alone:

| Precision | Bytes per parameter | 70B model weights | |---|---|---| | FP16 | 2 | ~140 GB (too big) | | 8-bit | 1 | ~70 GB | | 4-bit | 0.5 | ~35 GB |

A 70B model at 4-bit fits comfortably with room for context. The same model at FP16 does not. See unified memory for why the whole 128 GB is genuinely usable for this.