Kubernetes-native distributed LLM inference project built around vLLM, intelligent scheduling, KV-cache-aware routing, disaggregated serving, and accelerator portability.
llm-d · open-source · 2026-04-21
Score 14
How migrating from a simple vLLM deployment to a robust MLOps platform utilizing KServe, llm-d's intelligent routing, and vLLM solved significant scaling and operational challenges in LLM deployment through deep customization and prefix-ca...
High signal Matched: inference, gpu
llm-d · open-source · 2026-03-13
Score 18
A lightweight ML model trained online from live traffic replaces manually tuned heuristic weights with direct latency predictions, achieving 43% improvement in P50 end-to-end latency and 70% improvement in TTFT on a production-realistic wo...
High signal Matched: latency, ttft, model, weights
llm-d · open-source · 2026-02-10
Score 20
llm-d's new filesystem backend offloads KV cache to shared storage, enabling cross-replica reuse and up to 16.8x faster TTFT — scaling inference throughput without GPU or CPU memory limits.
High signal Matched: inference, throughput, kv cache, ttft, gpu
llm-d · open-source · 2026-02-04
Score 16
llm-d v0.5 introduces hierarchical KV-cache offloading, LoRA-aware scheduling, UCCL networking, and scale-to-zero autoscaling for sustained inference performance at scale.
High signal Matched: inference, performance, lora
llm-d · open-source · 2025-12-02
Score 30
llm-d v0.4 delivers 50% lower latency for MoE models via speculative decoding, expands TPU and XPU support, and adds prefix cache offloading for faster TTFT.
High signal Matched: decoding, prefix cache, speculative decoding, moe, performance, latency, ttft, tpu, sota
llm-d · open-source · 2025-10-10
Score 20
llm-d v0.3 adds Google TPU and Intel XPU support, wide expert parallelism at 2.2k tokens/sec per GPU, predicted latency scheduling, and Inference Gateway GA.
High signal Matched: inference, latency, tokens/sec, gpu, tpu
llm-d · open-source · 2025-09-24
Score 18
See how llm-d's precise KV-cache aware scheduling delivers 57x faster responses and 2x throughput in production distributed LLM inference benchmarks.
High signal Matched: inference, throughput, distributed, benchmarks
llm-d · open-source · 2025-09-03
Score 16
Learn how llm-d's intelligent inference scheduling uses prefix-aware, load-balanced routing to maximize LLM throughput and minimize latency on Kubernetes.
High signal Matched: inference, throughput, latency
llm-d · open-source · 2025-07-29
Score 10
llm-d v0.2 introduces well-lit paths for Kubernetes LLM deployment: intelligent scheduling, P/D disaggregation, and MoE support with vLLM optimizations.
High signal Matched: moe
llm-d · open-source · 2025-06-25
Score 10
Help shape llm-d's future: Take our 5-minute community survey, subscribe to our YouTube channel, and access exclusive resources for LLM serving innovation.
High signal Matched: serving
llm-d · open-source · 2025-06-03
Score 12
llm-d hits 1000 GitHub stars! Week 1-2 round-up covers KVTransfer Protocol, InferenceModel API updates, and community resources for LLM inference developers.
High signal Matched: inference, api
llm-d · open-source · 2025-05-20
Score 20
Introducing llm-d: Kubernetes-native distributed LLM inference with KV-cache routing, disaggregated serving, and SOTA performance per dollar. Built on vLLM.
High signal Matched: inference, serving, distributed, performance, introducing, sota
llm-d · open-source · 2025-05-20
Score 20
Red Hat launches llm-d: Open source distributed AI inference platform backed by NVIDIA, Google Cloud, IBM. Scale generative AI with intelligent routing on Kubernetes.
High signal Matched: inference, distributed, release, cloud, open source