llm-d v0.7: From Feature Introduction to Production Hardening
If v0.6 was about proving what llm-d could do—OTel integration, prefill/decode disaggregation, initial multi-accelerator images—then v0.7 is about making sure you can actually deploy it. The theme across every category is the same: remove friction, broaden hardware reach, and give operators the documentation and CI coverage to trust the system in production.
Recent external validations demonstrated llm-d's performance gains. Those capabilities remain and continue to improve, but v0.7's investment is making them accessible: onboarding guides, tested installation paths, and confidence the guides work on your target platform.
Operator Experience: Standalone Mode and Kustomize-First​
A recurring feedback from platform teams evaluating llm-d was that initial deployments involved too many moving parts. In v0.7, we have fundamentally simplified the day-one experience.
Standalone Mode Default: The default deployment now runs in standalone mode using a generic proxy (Envoy), rather than requiring the full Gateway API Inference Extension (GAIE) stack. You can go from clone to serving in minutes, and layer on gateway features as your deployment matures. While the full GAIE power remains available—GAIE has been updated to v1.5.0, Istio to 1.29.2, and AgentGateway to v1.1.0—it is no longer the first thing you encounter. This responds directly to feedback that initial setup had too many moving parts.
Kustomize Migration: The entire installation surface has been migrated from Helm and Helmfile to Kustomize. Every deployment guide—optimized baseline, prefix-cache-aware routing, prefill/decode disaggregation, workload-variant autoscaling, Intel XPU deployment—now uses Kustomize bases with per-platform overlays for GKE, OpenShift, and other Kubernetes distributions. The previous llm-d-infra and llm-d-modelservice Helm charts are deprecated, and the monolithic Helmfile CI workflow has been replaced with ci-kustomize-dry-run.yaml.
The technical reasoning is pragmatic: Kustomize overlays compose more naturally than Helm value files when patching platform-specific details like node selectors, storage classes, or NCCL tuner configurations without maintaining separate charts. For teams already managing Kubernetes resources with GitOps tools like ArgoCD or Flux, this eliminates a tooling boundary and makes llm-d configurations first-class citizens in existing deployment pipelines.
Documentation Built from Scratch​
The documentation overhaul is the largest single category in this release by volume, with v0.7 introducing over 10,000 lines of new documentation content. To ensure teams can onboard without tribal knowledge, the entire guide taxonomy was restructured for clarity: what was called "inference-scheduling" became "optimized-baseline" to better reflect what it delivers. New guides for batch gateway, flow control, and predicted latency routing fill gaps that previously required reading source code.
Architecture Reference: Full coverage of EPP design (scheduling, flow control, request handling), KV management subsystems (KV Indexer, KV Offloader, routing plugin integration), disaggregation operations (KV cache transfer, NIXL library, routing sidecar coordination), autoscaling mechanics, and the latency predictor. A new API reference section documents EPP HTTP headers, EndpointPickerConfig, InferenceModelRewrite, InferenceObjective, and InferencePool resources with concrete examples.
Visual Topology: New diagrams make system architecture legible at a glance. Each major subsystem—KV cache indexing, prefill/decode coordination, multi-tier offloading—has visual representations showing component interaction and data flow, providing the conceptual grounding operators need to reason about performance and troubleshoot issues.
Operational Tooling: Interactive tuning scripts for flow control let operators explore parameter spaces without editing YAML by hand. A new health check script validates deployments end-to-end, catching configuration errors before they surface in production. The autoscaling documentation received a five-part series covering fundamentals, HPA/KEDA design choices, Workload Variant Autoscaler deep-dive, native HPA patterns, and guide alignment—clarifying when WVA's variant-aware logic provides value over simpler approaches.
Broadening Hardware and Disaggregation Reach​
v0.7 expands our footprint as a hardware-agnostic control plane, readying the stack for the next generation of accelerators.
CUDA 13 and Blackwell Support: We have updated from CUDA 12.9.1 to 13.0.2—a breaking change requiring NVIDIA driver 580 or later. This unlocks first-class support for Blackwell-generation hardware, including a dedicated llm-d-cuda-gb200 Docker image with GB200-specific DeepEP wheels and NVSHMEM support. The TORCH_CUDA_ARCH_LIST now natively includes sm100 targets. The core inference stack has also been updated, bringing vLLM to 0.19.1 (now tracking the neuralmagic fork with targeted NaN-fix cherry-picks), LMCache to v0.4.4, and FlashInfer to v0.6.6. Build infrastructure improvements include sccache in the runtime stage and MAX_JOBS=3 as default to prevent out-of-memory failures during source builds.
Expanded Ecosystem: We have added Rebellions as a supported accelerator vendor, shipped an AMD ROCm deployment guide, and enabled Tensor Parallelism (TP=2) for Qwen3-32B on Intel XPUs. TPU support now spans v6e and v7 with consistent configurations across generations, including a new RunAI model streamer integration. HPU and CPU images have been updated.
If you are evaluating llm-d against a mixed-accelerator fleet—combining NVIDIA GPUs with AMD MI-series, Intel XPUs, Google TPUs, or Rebellions ATOM chips—v0.7 is the first release where every supported platform ships with tested images, validated guides, and production-grade documentation.
Prefill/Decode (P/D) Disaggregation: Prefill/decode disaggregation—where prefill computation (token-in, compute-bound) runs on separate pods from decode (token-out, memory-bandwidth-bound)—enables higher throughput by allowing independent scaling of each phase and eliminating prefill interference with decode latency. Version 0.7 extends P/D support to Google TPU v7 with a complete guide for Qwen 3.5, adds an Oracle Cloud Infrastructure deployment path, and ships a comprehensive three-part disaggregation architecture series explaining how KV cache transfer works, what NIXL (NVIDIA Inference Xfer Library) provides, and how the routing sidecar coordinates transfer between prefill and decode instances. The guide has been rewritten with Kustomize and now includes per-platform overlays for NVIDIA GPUs (vLLM and SGLang), Google TPUs, and Oracle Cloud Infrastructure with platform-specific tuning isolated in overlay patches.
Workload-Aware Orchestration​
Intelligence is increasingly moving from the infrastructure layer into workload characteristics. v0.7 introduces two major experimental features to handle complex multi-tenant environments:
Flow Control: We have introduced centralized request queuing and admission control at the Router level. Traditional load balancing treats all requests identically and commits them immediately to backend queues. This breaks down for LLM inference where resource consumption varies wildly per request—a 32K-token prefill followed by minimal decode consumes radically different resources than a balanced workload—and where multi-tenant environments require fairness guarantees.
Flow Control moves request queuing to the llm-d Router, enabling intelligent admission control before requests reach model servers. Incoming requests are classified by a FlowKey combining Fairness ID and Priority. The Endpoint Picker (EPP) maintains separate in-memory queues for each flow and dispatches based on three criteria: priority (service highest-priority bands first), fairness (cycle through tenants within a band), and ordering (maintain request sequence within a flow). This architecture prevents noisy neighbors from starving other tenants, enables "no-regret scheduling" that holds requests during saturation rather than committing them to stuck local queues, and integrates with InferenceObjective resources for dynamic SLO enforcement. The feature ships with defaults explicitly designed to mimic legacy first-come-first-served behavior, ensuring seamless transition for existing workloads.
Batch Gateway: An OpenAI-compatible Batch API designed for large-scale batch inference. Production LLM deployments rarely serve only interactive traffic. Batch workloads—large-scale embedding generation, synthetic data creation, offline document processing—represent significant compute demand but have fundamentally different latency tolerance and scheduling requirements than chat or completion endpoints.
The new Batch Gateway provides an OpenAI-compatible /v1/batches and /v1/files API for submitting and tracking batch jobs containing up to 50,000 requests. The architecture separates batch from interactive traffic while enabling both to coexist on shared infrastructure through integration with flow control backpressure mechanisms. Three components handle the batch lifecycle: API Server (REST endpoints for job submission, management, file handling), Batch Processor (pulls jobs from a priority queue, builds per-model execution plans, dispatches to llm-d Router, writes results to output files), and Garbage Collector (cleans up expired jobs and files based on configurable retention policies). Storage is pluggable across layers: PostgreSQL or Redis/Valkey for metadata, Redis/Valkey for queue and event streams, and S3 or filesystem for input and output files.
Both Flow Control and Batch Gateway ship as experimental features, reflecting their operational complexity and the need for real-world validation before stabilization.
Multi-Tier KV Caching and Intelligent Routing​
Managing KV-cache pressure effectively dictates whether a system can sustain low tail-latency under load. Version 0.7 significantly expands KV cache management capabilities across three dimensions: more intelligent routing, multi-tier storage, and comprehensive architectural documentation.
Tiered Prefix Cache Storage: Building on our storage architectures, v0.7 introduces an AWS EFS backend, formalizing multi-tier KV cache offloading from GPU HBM to CPU DRAM to persistent storage (local NVMe or network-attached storage like AWS EFS). This enables handling longer contexts than would fit in GPU memory alone, trading storage I/O latency for the ability to preserve and reuse expensive prefill computation. The tiered storage architecture integrates with vLLM's OffloadingConnector, making the optimization transparent to model servers while the router manages tier placement and retrieval.
Precise Prefix Cache-Aware Routing: Real-time prefix cache tracking is now enabled by default, replacing hashSeed-based workarounds with precise tracking. The precise-prefix-cache-aware guide now enables speculative indexing by default, removing hashSeed and PYTHONHASHSEED workarounds that previously limited cache effectiveness. The router tracks cache state in real-time across the fleet, scoring candidate endpoints based on prefix overlap and routing to pods most likely to have relevant blocks already resident in GPU memory. This routing intelligence has been validated across platforms—passing nightly tests on both Container Kubernetes Service and OpenShift—and now extends to Intel XPU deployments where TP=2 configurations for Qwen3-32B are fully supported and tested.
Predicted Latency-Based Scheduling: We have completely rewritten the predicted latency architecture with a new production-ready guide. The architecture documentation explains how the latency predictor provides routing intelligence, allowing the system to natively route requests based on latency predictions rather than basic heuristics.
New documentation sections explain the KV cache management subsystem in depth: how the KV Indexer maintains a distributed index of block locations across pods, how the KV Offloader coordinates tier transitions, and how routing plugins consume this state to make cache-aware placement decisions.
CI That Tests What You Actually Deploy​
The operational rigor of a system is only as good as its CI pipeline. We have entirely rewritten our testing infrastructure, moving away from monolithic accelerator scripts (550-600 lines each) to focused, per-guide nightly test matrices.
Multi-Platform Validation: New nightly jobs cover optimized baseline, precise prefix cache, predicted latency, and tiered prefix cache with CPU offloading, each parameterized across CKS, GKE, and OpenShift. This ensures guides work on the platforms they claim to support.
Real Optimization Assertions: The predicted latency validation workflow exemplifies the new rigor. Our predicted latency validation scrapes EPP /metrics endpoints to verify the predictor served real predictions instead of falling back to heuristics. This is not smoke testing; it asserts that the optimization path you configured is actually running.
Additional CI improvements include an image verification workflow, a /test-nightly slash command with glob patterns enabling fork contributors to trigger targeted nightly runs, and a badge matrix generator keeping the README status table synchronized with actual test results. Nightly results now post status comments directly on pull requests, surfacing failures immediately rather than requiring manual log inspection.
Across 63 changed workflow files, the net delta is actually 2,785 fewer lines of YAML—consolidation rather than expansion—while covering substantially more of the deployment surface.
What This Means for Your Evaluation​
If you evaluated llm-d previously and concluded it was not ready for production use, v0.7 merits a second look. The deployment path is simpler through standalone mode and Kustomize-native installation. Hardware coverage spans Blackwell GPUs, AMD ROCm, Intel XPUs, Google TPUs, Rebellions ATOM, and CPU deployments. Documentation is comprehensive enough to onboard a team without requiring tribal knowledge or source code archaeology. The CI matrix covers enough deployment surface to provide confidence that guides actually work on the platforms they claim to support.
This release represents a 3.5× increase in pull request volume over v0.6. The project is accelerating, and the v0.7 investment in foundations—documentation, testing, installation simplification—is the kind of work that compounds in value as the project scales.
What Is Next?​
The project is accelerating, and the foundational investments made in v0.7 set the stage for upcoming milestones. The v0.8 release milestone focuses on production readiness, reinforcement learning workloads, and expanded accelerator coverage.
Production graduation: Flow Control and Batch Gateway move from experimental to production-ready. Flow Control receives guide documentation for multi-tenant deployments, while Batch Gateway undergoes hardening with enterprise storage backends and at-scale validation. Multi-modal serving—image+text models like LLaVA and Qwen-VL—gains production guides and end-to-end testing.
Reinforcement learning infrastructure: Initial RL support lands through Python scheduler integration with the Endpoint Picker, enabling gym-style RL training loops to directly influence routing decisions. Non-Kubernetes deployment mode enables llm-d in Slurm clusters and research labs through file-based service discovery. GPU time-slicing platform support allows multiple RL rollout workers to share accelerators during policy evaluation phases.
Advanced disaggregation: NIXL (NVIDIA Inference Xfer Library) receives TTL-based cache eviction and request draining for graceful pod termination. KV transfer connectors expand beyond NVIDIA with Mooncake (Alibaba) and MoRI (Microsoft) integrations. GB200 NVL72 configurations gain dedicated guides. Multi-tier storage offloading extends to object storage backends alongside the existing filesystem tier.
CI and operational rigor: Nightly tests expand to TPU and AMD ROCm platforms. Each guide receives automated lm-eval and performance benchmarking in CI. The release process itself gets documented, and CI jobs migrate to nightly-built images rather than on-demand builds. Monitoring setup becomes streamlined with consolidated dashboards and alerting configurations across guides.
Deployment flexibility: Multi-model serving guides demonstrate running different models on shared hardware with LoRA adapter scheduling. Rollout guides cover blue/green deployments and version management for zero-downtime updates. Gateway installation prerequisites consolidate into a single reference, and request draining configurations become standard across all guides.
We invite the community to explore the new capabilities and deploy the new well-lit paths. The full release notes and changelog are available on GitHub.
Community and Contribution​
The llm-d community continues to expand: 23 new contributors in this release alone, with hardware vendor contributions from Rebellions, Intel, and AMD alongside ecosystem partners and individual developers. The project's velocity reflects both technical momentum and broadening adoption.
To get started with llm-d v0.7:
- Documentation: llm-d.ai
- Guides: Well-lit paths for optimized baseline, prefix-cache routing, P/D disaggregation, and more
- Release notes: v0.7.0 on GitHub
- Slack community: Join the conversation at llm-d.ai/slack
Follow @llm_d on Twitter for updates, or check llm-d.ai/community/events for upcoming community events. Come build with us.
