May 2, 2026 max-enginemodulardgx-sparkinfrastructure

Running Gemma-4-31B on DGX Spark with MAX Engine — What Broke and How We Fixed It

Getting Modular's MAX Engine to serve a 31B parameter model on NVIDIA's smallest Grace Blackwell system.

By Adam Kruger

If you have a DGX Spark and tried to run MAX Engine, you probably hit a wall. We did too. Here's what we found, what we fixed, and what's still broken — so you don't have to spend a weekend debugging it.

The Setup

NVIDIA DGX Spark (GB10, ARM64 Grace CPU, 128GB unified memory)
MAX Engine 26.4.0 nightly
Gemma-4-31B-IT

We wanted to serve Gemma-4 locally for research. MAX Engine claims day-zero Gemma-4 support and DGX Spark compatibility. The reality was more nuanced.

Issue 1: Memory Estimation on Unified Memory

MAX's memory estimator reads CUDA-reported GPU memory (~34GB) and rejects any model larger than that. On the DGX Spark, the GPU and CPU share a 128GB unified memory pool — but CUDA only reports the GPU's nominal allocation as "GPU memory."

The fix: Detect unified memory systems (check /sys/class/dmi/id/product_name for "DGX_Spark" or "AI TOP ATOM") and read system available memory from /proc/meminfo instead of CUDA's free_memory stat.

This is a one-function patch to MemoryEstimator.free_memory() in MAX's pipeline configuration.

Issue 2: Pydantic Type Resolution

MAX's PipelineConfig and MAXModelConfig use deferred type annotations that reference PyTorch tensor types. If you don't import torch before the config objects are created, Pydantic throws a PydanticUserError about models not being "fully defined."

The fix: Import torch and call model_rebuild() on both config classes before running the CLI.

Issue 3: CUDA Memory Overallocation

Without constraints, CUDA allocates the entire unified memory pool during model loading and graph compilation. On a desktop GPU with separate VRAM, this is fine — the OS has its own memory. On unified memory, CUDA eating everything kills the OS.

The fix: We use a CUDA memory limiter shim — a shared library loaded via LD_PRELOAD that intercepts cuMemAlloc at the driver level and enforces a hard ceiling. Set it to 90-96GB to leave headroom for the OS. (We've open-sourced this tool separately — link at bottom.)

Issue 4: The Sampling Kernel (Still Open)

With all three patches applied, MAX Engine successfully:

Compiles the Gemma-4 language model graph (285 seconds)
Compiles the vision model (5 seconds)
Loads weights into unified memory
Captures device graphs
Starts the OpenAI-compatible API server

It even responds to inference requests — with greedy sampling (temperature=0).

But the top-K/top-P sampling kernel crashes with CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES. The kernel at topk_fi.mojo:1419 requests more registers or threads per block than the GB10's SM 12.1 architecture supports.

This is a kernel tuning issue inside MAX's compiled core — we can't patch it ourselves. We've filed a detailed bug report with full reproduction steps (modular/modular#6488).

Workaround: Use temperature: 0 for greedy decoding. It bypasses the broken sampling kernel entirely and produces correct output.

Results

With all patches applied:

What	Status
Model compilation	Works (290s total)
Weight loading	Works
Server startup	Works
Greedy sampling	Works
Top-K/Top-P sampling	Broken (kernel resource limit)

Gemma-4-31B serves inference on a DGX Spark through MAX Engine. Not perfectly — but it works.

For DGX Spark Owners

If you want to try this yourself:

Install the CUDA memory limiter shim before anything else
Apply the unified memory patch to MAX's memory estimator
Use --max-batch-size 4 to keep graph capture memory reasonable
Set temperature: 0 until Modular fixes the sampling kernel

We're happy to share our wrapper script and patches. Reach out or check the GitHub issue for details.

Built by Light of Baldr LLC. Get in touch if any of this is useful to you.