Introduction

Following my previous guides on vLLM setup and GPU passthrough in Proxmox, this guide covers SGLang - the high-performance LLM serving framework that’s emerged as the leading alternative to vLLM.

In this guide, I’ll walk through setting up SGLang on dual RTX PRO 6000 Blackwell GPUs (192GB total VRAM) and running MiniMax M2 with NVFP4 quantization - a powerful combination that delivers excellent performance for local AI inference.

Why SGLang?

SGLang offers several advantages over vLLM:

  • RadixAttention: Superior KV cache management for multi-turn conversations
  • DeepSeek optimizations: Day-1 support with MLA throughput optimizations
  • Python-first frontend: Flexible chained generation calls
  • Industry adoption: Powers 400,000+ GPUs worldwide (xAI, AMD, NVIDIA, Cursor)

Why MiniMax M2 with NVFP4?

MiniMax M2 is a 230B parameter MoE model with 10B active parameters - excellent for coding and agentic tasks. The NVFP4 quantization:

  • Only quantizes MoE expert MLP layers (attention stays in BF16)
  • Significant memory savings (~50% reduction)
  • Runs on 2x 96GB GPUs instead of requiring 4x H100s

Prerequisites

Hardware

  • 2x NVIDIA RTX PRO 6000 Blackwell (96GB each)
  • 192GB total VRAM required for MiniMax M2 NVFP4
  • CUDA driver 575+ (580+ recommended)

Software

  • CUDA 12.8 or 13.0
  • Python 3.12
  • UV package manager

Verify your GPU:

nvidia-smi
# Expected: Driver Version 580+, CUDA 13.0

Step 1: Install SGLang

Create a fresh Python environment and install SGLang with CUDA 13.0 support:

# Create directory
mkdir sglang-minimax && cd sglang-minimax

# Create venv with Python 3.12
uv venv --python 3.12
source .venv/bin/activate

Installation Script

Here’s the working installation and startup script for CUDA 13.x with Blackwell GPUs:

# Step 1: Install nightly sglang
uv pip install -U sglang --pre \
  --index-url https://sgl-project.github.io/whl/cu129/ \
  --extra-index-url https://pypi.org/simple \
  --extra-index-url https://download.pytorch.org/whl/cu130 \
  --index-strategy unsafe-best-match

# Step 2: Install CUDA 13.0 kernel
uv pip install -U sgl-kernel \
  --extra-index-url https://sgl-project.github.io/whl/cu130/ \
  --extra-index-url https://download.pytorch.org/whl/cu130 \
  --index-strategy unsafe-best-match

source .venv/bin/activate

uv run python3 -m sglang.launch_server \
  --model-path  /home/johannes/models/lukealonso/MiniMax-M2.5-NVFP4/ \
  --fp4-gemm-backend flashinfer_cutlass \
  --attention-backend triton \
  --served-model-name vllm \
  --trust-remote-code \
  --tensor-parallel-size 2 \
  --tool-call-parser minimax-m2 \
  --reasoning-parser minimax-append-think \
  --host 0.0.0.0 \
  --port 8000

Installation Parameters Explained

ParameterValueDescription
uv pip install -U sglang --preInstall latest pre-release (nightly) version
--index-url https://sgl-project.github.io/whl/cu129/Primary index for CUDA 12.9 builds
--extra-index-url https://pypi.org/simplePyPI for non-CUDA dependencies
--extra-index-url https://download.pytorch.org/whl/cu130CUDA 13.0 PyTorch wheels
--index-strategy unsafe-best-matchUse highest version across all indexes

Why these indexes?

  • cu129: SGLang provides CUDA 12.9 wheels with Blackwell-optimized kernels
  • cu130: PyTorch CUDA 13.0 wheels for runtime
  • unsafe-best-match: Ensures best version resolution across all indexes

Step 2: Download MiniMax M2 NVFP4

Download the NVFP4-quantized model using rust-hf-downloader - a fast Rust-based alternative to huggingface-cli:

Install rust-hf-downloader

# Install Rust if not already installed
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source ~/.cargo/env

# Install rust-hf-downloader
cargo install rust-hf-downloader

Download the Model

# Set HuggingFace token (required for gated models)
export HF_TOKEN=your_hf_token_here

# Download MiniMax-M2.5-NVFP4
rust-hf-downloader --headless download lukealonso/MiniMax-M2.5-NVFP4 \
    --output ~/models

Key Commands

CommandDescription
--headlessRun in CLI mode (required for CLI commands)
--outputSet output directory (default: ~/rust-hf-downloader)
--allDownload all files (not just recommended)
--dry-runPreview what would be downloaded
--jsonJSON output for scripting
resumeResume interrupted downloads

For the NVFP4 model, all files are required, so the default download is sufficient.


Note: Expert parallelism (--ep) can cause MiniMax M2 NVFP4 to fail to load. Use only --tp for this model.

Step 3: Run SGLang with Your Configuration

Complete Working Script

Save this as run.sh:

#!/bin/bash
source .venv/bin/activate

uv run sglang serve \
  --model-path /home/johannes/models/lukealonso/MiniMax-M2.5-NVFP4/ \
  --fp4-gemm-backend flashinfer_cutlass \
  --attention-backend triton \
  --served-model-name vllm \
  --trust-remote-code \
  --tensor-parallel-size 2 \
  --tool-call-parser minimax-m2 \
  --reasoning-parser minimax-append-think \
  --host 0.0.0.0 \
  --port 8000

Server Parameters Explained

ParameterValueDescription
--model-path/home/johannes/.../MiniMax-M2.5-NVFP4/Path to the NVFP4-quantized model
--fp4-gemm-backendflashinfer_cutlassFP4/FP8 GEMM backend: Uses FlashInfer CUTLASS kernels for optimized matrix multiplications with quantized weights. The CUTLASS backend is generally more stable on Blackwell GPUs compared to triton.
--attention-backendtritonAttention backend: Triton is more stable than FlashInfer on some Blackwell configurations. FlashInfer can cause crashes on first prompt with certain setups.
--served-model-namevllmAPI model name: The name exposed in the OpenAI-compatible API. Set to ā€œvllmā€ for compatibility with existing clients.
--trust-remote-codeTrust remote code: Required for MiniMax models as they contain custom code not in standard Transformers library.
--tensor-parallel-size2Tensor parallelism: Distributes model across 2 GPUs. For dual RTX PRO 6000, this uses both cards.
--tool-call-parserminimax-m2Tool calling parser: Enables structured tool calls for MiniMax M2 models. Parses tool call syntax in model output.
--reasoning-parserminimax-append-thinkReasoning parser: Handles MiniMax’s thinking/reasoning output format. The ā€œappendā€ variant keeps the thinking in the response.
--host0.0.0.0Host binding: Listens on all network interfaces.
--port8000Port: Standard port for OpenAI-compatible API.

Why These Specific Choices?

--fp4-gemm-backend flashinfer_cutlass:

  • NVFP4 quantized models require specialized GEMM kernels
  • FlashInfer CUTLASS provides the most stable Blackwell support
  • Alternative: triton if you encounter issues

--attention-backend triton:

  • Triton backend avoids FlashInfer crashes on Blackwell
  • Some users report needing --attention-backend triton --sampling-backend pytorch to prevent crashes

--tensor-parallel-size 2:

  • Matches your 2x RTX PRO 6000 Blackwell setup
  • Each GPU handles half the model layers

--tool-call-parser minimax-m2 and --reasoning-parser minimax-append-think:

  • MiniMax M2 has custom output formats for tool calling and reasoning
  • These parsers handle the special tokens and formatting

Run the Server

chmod +x run.sh
./run.sh

Expected Startup Output

[2026-03-30 14:21:44 TP1] Fixing v5 tokenizer component mismatch for /home/johannes/models/lukealonso/MiniMax-M2.5-NVFP4/: pre_tokenizer ByteLevel -> Sequence, decoder ByteLevel -> ByteLevel
[2026-03-30 14:21:44 TP0] Fixing v5 tokenizer component mismatch for /home/johannes/models/lukealonso/MiniMax-M2.5-NVFP4/: pre_tokenizer ByteLevel -> Sequence, decoder ByteLevel -> ByteLevel
[2026-03-30 14:21:44 TP0] max_total_num_tokens=181721, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2048, context_len=196608, available_gpu_mem=7.74 GB
[2026-03-30 14:21:44] INFO:     Started server process [171156]
[2026-03-30 14:21:44] INFO:     Waiting for application startup.
[2026-03-30 14:21:44] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 1.0, 'top_k': 40, 'top_p': 0.95}
[2026-03-30 14:21:44] INFO:     Application startup complete.
[2026-03-30 14:21:44] INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
[2026-03-30 14:21:45] INFO:     127.0.0.1:52234 - "GET /model_info HTTP/1.1" 200 OK
[2026-03-30 14:21:46 TP0] Prefill batch, #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, cuda graph: True, input throughput (token/s): 0.00
[2026-03-30 14:21:46] INFO:     127.0.0.1:52246 - "POST /generate HTTP/1.1" 200 OK
[2026-03-30 14:21:46] The server is fired up and ready to roll!

Output Explanation:

LineDescription
Fixing v5 tokenizer component mismatchSGLang auto-fixes tokenizer config for HuggingFace v5 compatibility - normal
max_total_num_tokens=181721Maximum tokens that can be processed (196K context)
available_gpu_mem=7.74 GBRemaining VRAM available for KV cache and dynamic allocations (~88GB used by model + KV cache pool on each GPU)
chunked_prefill_size=8192Prefill is chunked for memory efficiency
cuda graph: TrueCUDA graphs enabled for optimization
Using default chat sampling paramsModel’s built-in generation config (temperature=1.0, top_p=0.95)
The server is fired up and ready to roll!Server ready for requests

Note on Memory Usage: With 96GB GPUs, ~88GB is used for model weights and the KV cache pool. The remaining ~7.74GB is reserved for activations, gradients, and dynamic allocations during inference. The --mem-fraction-static flag controls how much is reserved for model weights (default ~90%), with the rest available for KV cache.


Step 4: Performance Tuning

Optimization Flags

For Blackwell GPUs with CUDA 13.x, these flags improve stability and performance:

# Environment variables for better Blackwell support
export PYTORCH_ALLOC_CONF=expandable_segments:True
export NCCL_IB_DISABLE=1
export NCCL_P2P_LEVEL=PHB

Attention Backend Selection

The user’s configuration uses --attention-backend triton. Here’s why:

BackendProsCons
tritonMore stable on Blackwell, less crash-proneSlightly slower than FlashInfer
flashinferFaster attention computationCan crash on first prompt with Blackwell
flashinfer_workspaceOptimized workspace allocationRequires more tuning

FP4 GEMM Backend Selection

BackendProsCons
flashinfer_cutlassMost stable for NVFP4
tritonAlternative if issuesMay have warnings
cutlassFallback optionLess optimized

Troubleshooting Common Issues

Issue: CUDA Graph Capture Fails

If you see errors during batch capture:

# Disable CUDA graphs if unstable
--disable-cuda-graph

Issue: High Idle GPU Usage

If GPUs show 100% utilization when idle:

# Enable sleep when idle
--enable-sleep-on-idle
# Or
--sleep-when-idle

Issue: DeepGEMM Warnings

These are typically cosmetic. For reduced warnings:

# Try alternative MoE backend
--moe-runner-backend triton

Benchmarking

Test with a simple request:

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vllm",
    "messages": [
      {"role": "user", "content": "Write a Python function to calculate Fibonacci numbers."}
    ],
    "max_tokens": 500,
    "temperature": 0.7
  }'

Note: Use vllm as the model name since that’s what was set with --served-model-name vllm.

Expected Performance

With 2x RTX PRO 6000 Blackwell and MiniMax M2 NVFP4:

MetricExpected Value
Prompt Throughput~8,000-12,000 tokens/s
Generation Throughput~80-120 tokens/s
Latency (first token)~100-200ms
VRAM Usage~87-88GB per GPU

Monitoring

The metrics endpoint requires the --enable-metrics flag to be enabled:

# Add --enable-metrics to your server launch
uv run sglang serve \
  --model-path ~/models/lukealonso/MiniMax-M2.5-NVFP4/ \
  --fp4-gemm-backend flashinfer_cutlass \
  --attention-backend triton \
  --served-model-name vllm \
  --trust-remote-code \
  --tensor-parallel-size 2 \
  --tool-call-parser minimax-m2 \
  --reasoning-parser minimax-append-think \
  --enable-metrics \
  --host 0.0.0.0 \
  --port 8000

Access Metrics

# View metrics
curl http://localhost:8000/metrics

# In browser
# Navigate to http://localhost:8000/metrics

Key Metrics

MetricDescription
sglang_prefill throughputInput processing speed (tokens/s)
sglang_decode_throughputOutput generation speed (tokens/s)
gpu_kv_cache_usageKV cache utilization
radix_cache_hit_ratePrefix cache efficiency

Comparing with vLLM

Based on community benchmarks:

FeatureSGLangvLLM
Multi-GPU (2x) Performance+150% better with DPBaseline
RadixAttentionNativeLimited
DeepSeek OptimizationsDay-1Delayed
Blackwell SupportGoodGood
NVFP4 SupportGoodGood

For multi-GPU setups where the model fits in total VRAM, SGLang’s data parallelism typically outperforms vLLM’s tensor parallelism.


Summary

Here’s the working configuration for CUDA 13.x with dual RTX PRO 6000 Blackwell:

# 1. Setup environment
mkdir sglang-minimax && cd sglang-minimax
uv venv --python 3.12
source .venv/bin/activate

# 2. Install SGLang (nightly with CUDA 13.x support)
uv pip install -U sglang --pre \
  --index-url https://sgl-project.github.io/whl/cu129/ \
  --extra-index-url https://pypi.org/simple \
  --extra-index-url https://download.pytorch.org/whl/cu130 \
  --index-strategy unsafe-best-match

# 3. Install CUDA 13.0 kernel
uv pip install -U sgl-kernel \
  --extra-index-url https://sgl-project.github.io/whl/cu130/ \
  --extra-index-url https://download.pytorch.org/whl/cu130 \
  --index-strategy unsafe-best-match

# 4. Run server
uv run sglang serve \
  --model-path /home/johannes/models/lukealonso/MiniMax-M2.5-NVFP4/ \
  --fp4-gemm-backend flashinfer_cutlass \
  --attention-backend triton \
  --served-model-name vllm \
  --trust-remote-code \
  --tensor-parallel-size 2 \
  --tool-call-parser minimax-m2 \
  --reasoning-parser minimax-append-think \
  --host 0.0.0.0 \
  --port 8000

With dual RTX PRO 6000 Blackwell GPUs and MiniMax M2 NVFP4, you have a powerful local inference setup capable of handling demanding coding and agentic workloads.

Key takeaways:

  • Use cu129 index for SGLang wheels with Blackwell support
  • Use triton attention backend for stability on Blackwell
  • Use flashinfer_cutlass for FP4 GEMM operations
  • Enable tool-call-parser and reasoning-parser for MiniMax models

Happy serving!

Share this post