Why AIBrix Over Running vLLM Directly?
You’ve got a fresh Fedora 42 box with NVIDIA GPUs and you want to serve LLMs. The obvious path is pip install vllm or spinning up a Docker container. Both work well — vLLM supports multi-GPU inference via --tensor-parallel-size regardless of how you run it, and for a single model on a single server, either approach is perfectly fine.
Where AIBrix pulls ahead is when you move beyond “one model, one server.” It gives you intelligent request routing across multiple model replicas, automatic autoscaling based on load, GPU sharing policies, multi-model management, and observability — all through standard Kubernetes resources with an OpenAI-compatible API. You’re not just running a model; you’re running an inference platform that can grow with your needs.
The KV Cache Advantage
One of AIBrix’s most compelling features is its distributed KV cache. In a standard vLLM setup, each engine instance maintains its own isolated KV cache — meaning shared prompt prefixes get recomputed independently across replicas, wasting GPU cycles and HBM bandwidth. AIBrix solves this with a tiered caching architecture:
- L1 DRAM-based caching — offloads GPU memory pressure to CPU RAM with minimal latency overhead, significantly expanding effective cache capacity.
- L2 distributed caching — a shared KV cache layer that spans multiple nodes, enabling cross-engine KV reuse so that common prompt prefixes computed by one engine are available to all others.
- KV event synchronization — real-time coordination of cache states across distributed nodes to maximize prefix cache hit rates.
In practice, combining AIBrix’s distributed KV cache with vLLM’s built-in prefix caching has been shown to improve peak throughput by ~50%, reduce average TTFT by ~60% and P99 TTFT by ~70%, and lower inter-token latency by up to 70%. This is something you simply cannot get from a standalone vLLM deployment, whether via pip or Docker.
Why the Minikube None Driver?
Minikube’s default driver (Docker or KVM2) runs Kubernetes inside a nested VM. That’s an extra layer of abstraction between your pods and your GPUs — and it costs you performance, memory, and complicates GPU passthrough. The --driver=none flag tells Minikube to run Kubernetes directly on the host, no VM involved. Combined with --container-runtime=containerd, this means your vLLM pods talk to the NVIDIA GPUs on bare metal with zero overhead. This is the setup you want on a dedicated server where you control the host.
Install Prerequisites
# Kubernetes tools
curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64
sudo install minikube-linux-amd64 /usr/local/bin/minikube
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
# Helm
sudo dnf install -y helm
# Kubernetes dependencies
sudo dnf install -y conntrack containernetworking-plugins
# NVIDIA container toolkit
sudo dnf install -y nvidia-container-toolkit
Disable Swap & SELinux
sudo swapoff -a
sudo sed -i '/ swap / s/^\(.*\)$/#\1/g' /etc/fstab
sudo setenforce 0
Start Minikube (None Driver)
sudo -E minikube start --driver=none --container-runtime=containerd
Configure kubectl
mkdir -p ~/.kube
sudo cp /etc/kubernetes/admin.conf ~/.kube/config
sudo chown $(id -u):$(id -g) ~/.kube/config
chmod 600 ~/.kube/config
Configure Containerd for NVIDIA
sudo nvidia-ctk runtime configure --runtime=containerd --set-as-default
sudo systemctl restart containerd
Install NVIDIA Device Plugin
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install nvidia-device-plugin nvidia/nvidia-device-plugin \
--namespace nvidia-device-plugin --create-namespace \
--set gpuSharingStrategy=none
Verify GPUs
kubectl describe node | grep -A 15 "Capacity"
# Should show: nvidia.com/gpu: 2
Install AIBrix
kubectl create -f https://github.com/vllm-project/aibrix/releases/download/v0.6.0/aibrix-dependency-v0.6.0.yaml
kubectl create -f https://github.com/vllm-project/aibrix/releases/download/v0.6.0/aibrix-core-v0.6.0.yaml
# Wait for pods
kubectl get pods -n aibrix-system -w
Deploy a vLLM Model
Save as model.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
model.aibrix.ai/name: my-model
model.aibrix.ai/port: "8000"
name: my-model
namespace: default
spec:
replicas: 1
selector:
matchLabels:
model.aibrix.ai/name: my-model
template:
metadata:
labels:
model.aibrix.ai/name: my-model
model.aibrix.ai/port: "8000"
spec:
containers:
- name: vllm-openai
image: vllm/vllm-openai:latest
command:
- vllm
- serve
- --model
- "meta-llama/Llama-3.2-3B-Instruct"
- --served-model-name
- my-model
- --port
- "8000"
resources:
limits:
nvidia.com/gpu: "1"
requests:
nvidia.com/gpu: "1"
---
apiVersion: v1
kind: Service
metadata:
name: my-model
namespace: default
spec:
ports:
- port: 8000
targetPort: 8000
selector:
model.aibrix.ai/name: my-model
kubectl apply -f model.yaml
That’s it! Your AIBrix cluster with GPU support is ready.