If you have a DGX Spark and tried to run MAX Engine, you probably hit a wall. We did too. Here's what we found, what we fixed, and what's still broken — so you don't have to spend a weekend debugging it.
The Setup
- NVIDIA DGX Spark (GB10, ARM64 Grace CPU, 128GB unified memory)
- MAX Engine 26.4.0 nightly
- Gemma-4-31B-IT
We wanted to serve Gemma-4 locally for research. MAX Engine claims day-zero Gemma-4 support and DGX Spark compatibility. The reality was more nuanced.
Issue 1: Memory Estimation on Unified Memory
MAX's memory estimator reads CUDA-reported GPU memory (~34GB) and rejects any model larger than that. On the DGX Spark, the GPU and CPU share a 128GB unified memory pool — but CUDA only reports the GPU's nominal allocation as "GPU memory."
The fix: Detect unified memory systems (check /sys/class/dmi/id/product_name for "DGX_Spark" or "AI TOP ATOM") and read system available memory from /proc/meminfo instead of CUDA's free_memory stat.
This is a one-function patch to MemoryEstimator.free_memory() in MAX's pipeline configuration.
Issue 2: Pydantic Type Resolution
MAX's PipelineConfig and MAXModelConfig use deferred type annotations that reference PyTorch tensor types. If you don't import torch before the config objects are created, Pydantic throws a PydanticUserError about models not being "fully defined."
The fix: Import torch and call model_rebuild() on both config classes before running the CLI.
Issue 3: CUDA Memory Overallocation
Without constraints, CUDA allocates the entire unified memory pool during model loading and graph compilation. On a desktop GPU with separate VRAM, this is fine — the OS has its own memory. On unified memory, CUDA eating everything kills the OS.
The fix: We use a CUDA memory limiter shim — a shared library loaded via LD_PRELOAD that intercepts cuMemAlloc at the driver level and enforces a hard ceiling. Set it to 90-96GB to leave headroom for the OS. (We've open-sourced this tool separately — link at bottom.)
Issue 4: The Sampling Kernel (Still Open)
With all three patches applied, MAX Engine successfully:
- Compiles the Gemma-4 language model graph (285 seconds)
- Compiles the vision model (5 seconds)
- Loads weights into unified memory
- Captures device graphs
- Starts the OpenAI-compatible API server
It even responds to inference requests — with greedy sampling (temperature=0).
But the top-K/top-P sampling kernel crashes with CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES. The kernel at topk_fi.mojo:1419 requests more registers or threads per block than the GB10's SM 12.1 architecture supports.
This is a kernel tuning issue inside MAX's compiled core — we can't patch it ourselves. We've filed a detailed bug report with full reproduction steps (modular/modular#6488).
Workaround: Use temperature: 0 for greedy decoding. It bypasses the broken sampling kernel entirely and produces correct output.
Results
With all patches applied:
| What | Status |
|---|---|
| Model compilation | Works (290s total) |
| Weight loading | Works |
| Server startup | Works |
| Greedy sampling | Works |
| Top-K/Top-P sampling | Broken (kernel resource limit) |
Gemma-4-31B serves inference on a DGX Spark through MAX Engine. Not perfectly — but it works.
For DGX Spark Owners
If you want to try this yourself:
- Install the CUDA memory limiter shim before anything else
- Apply the unified memory patch to MAX's memory estimator
- Use
--max-batch-size 4to keep graph capture memory reasonable - Set
temperature: 0until Modular fixes the sampling kernel
We're happy to share our wrapper script and patches. Reach out or check the GitHub issue for details.