Serve a local LLM
You want a running inference server with an OpenAI-compatible API, so existing tools and your own code can talk to models hosted on the Spark.
There are several good options. Pick by how much control you want.
Ollama has native ARM64 and CUDA support and exposes an OpenAI-compatible API on port 11434.
# pull a modelollama pull qwen3:32b
# it is already serving — test the OpenAI-compatible endpointcurl http://localhost:11434/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model":"qwen3:32b","messages":[{"role":"user","content":"Hello"}]}'Good for: getting running in minutes, swapping models freely, desktop chat front-ends.
Build llama.cpp with CUDA and serve a GGUF model with an OpenAI-compatible server. NVIDIA has a dedicated playbook.
Good for: maximum control over quantization and sampling, running specific GGUF builds, squeezing the most out of the memory budget.
For higher-throughput serving and production-shaped deployments, use vLLM, SGLang, or TensorRT-LLM. These are heavier to set up but give you batching, paged attention, and better concurrency.
Good for: serving multiple concurrent users, benchmarking, multi-node setups across two Sparks.
Sizing the model to the memory
Section titled “Sizing the model to the memory”With 128 GB of unified memory you have far more headroom than a typical workstation, but the model weights plus KV cache still have to fit. A rough rule for the weights alone:
| Precision | Bytes per parameter | 70B model weights | |---|---|---| | FP16 | 2 | ~140 GB (too big) | | 8-bit | 1 | ~70 GB | | 4-bit | 0.5 | ~35 GB |
A 70B model at 4-bit fits comfortably with room for context. The same model at FP16 does not. See unified memory for why the whole 128 GB is genuinely usable for this.