Running Gemma-4-31B on DGX Spark with MAX Engine — What Broke and How We Fixed It
Getting Modular's MAX Engine to serve a 31B parameter model on NVIDIA's smallest Grace Blackwell system.
By Adam Kruger
If you have a DGX Spark and tried to run MAX Engine, you probably hit a wall. We did too. Here's what we found, what we fixed, and what's still broken — so you don't have to spend a weekend debugging it.
The Setup
- NVIDIA DGX Spark (GB10, ARM64 Grace CPU, 128GB unified memory)
- MAX Engine 26.4.0 nightly
- Gemma-4-31B-IT
We wanted to serve Gemma-4 locally for research. MAX Engine claims day-zero Gemma-4 support and DGX Spark compatibility. The reality was more nuanced.
Issue 1: Memory Estimation on Unified Memory
MAX's memory estimator reads CUDA-reported GPU memory (~34GB) and rejects any model larger than that. On the DGX Spark, the GPU and CPU share a 128GB unified memory pool — but CUDA only reports the GPU's nominal allocation as "GPU memory."
The fix: Detect unified memory systems (check /sys/class/dmi/id/product_name for "DGX_Spark" or "AI TOP ATOM") and read system available memory from /proc/meminfo instead of CUDA's free_memory stat.
This is a one-function patch to MemoryEstimator.free_memory() in MAX's pipeline configuration.
Issue 2: Pydantic Type Resolution
MAX's PipelineConfig and MAXModelConfig use deferred type annotations that reference PyTorch tensor types. If you don't import torch before the config objects are created, Pydantic throws a PydanticUserError about models not being "fully defined."
The fix: Import torch and call model_rebuild() on both config classes before running the CLI.
Issue 3: CUDA Memory Overallocation
Without constraints, CUDA allocates the entire unified memory pool during model loading and graph compilation. On a desktop GPU with separate VRAM, this is fine — the OS has its own memory. On unified memory, CUDA eating everything kills the OS.
The fix: We use a CUDA memory limiter shim — a shared library loaded via LD_PRELOAD that intercepts cuMemAlloc at the driver level and enforces a hard ceiling. Set it to 90-96GB to leave headroom for the OS. (We've open-sourced this tool separately — link at bottom.)
Issue 4: The Sampling Kernel (Still Open)
With all three patches applied, MAX Engine successfully:
- Compiles the Gemma-4 language model graph (285 seconds)
- Compiles the vision model (5 seconds)
- Loads weights into unified memory
- Captures device graphs
- Starts the OpenAI-compatible API server
It even responds to inference requests — with greedy sampling (temperature=0).
But the top-K/top-P sampling kernel crashes with CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES. The kernel at topk_fi.mojo:1419 requests more registers or threads per block than the GB10's SM 12.1 architecture supports.
This is a kernel tuning issue inside MAX's compiled core — we can't patch it ourselves. We've filed a detailed bug report with full reproduction steps (modular/modular#6488).
Workaround: Use temperature: 0 for greedy decoding. It bypasses the broken sampling kernel entirely and produces correct output.
Results
With all patches applied:
| What | Status |
|---|---|
| Model compilation | Works (290s total) |
| Weight loading | Works |
| Server startup | Works |
| Greedy sampling | Works |
| Top-K/Top-P sampling | Broken (kernel resource limit) |
Gemma-4-31B serves inference on a DGX Spark through MAX Engine. Not perfectly — but it works.
For DGX Spark Owners
If you want to try this yourself:
- Install the CUDA memory limiter shim before anything else
- Apply the unified memory patch to MAX's memory estimator
- Use
--max-batch-size 4to keep graph capture memory reasonable - Set
temperature: 0until Modular fixes the sampling kernel
We're happy to share our wrapper script and patches. Reach out or check the GitHub issue for details.
Built by Light of Baldr LLC. Get in touch if any of this is useful to you.