llm-d

Kubernetes-native distributed LLM inference project built around vLLM, intelligent scheduling, KV-cache-aware routing, disaggregated serving, and accelerator portability.

Country: Unknown
Category: open-source
Blog: https://llm-d.ai/
Feed: https://llm-d.ai/blog/rss.xml
Feed discovery status: known

llm-d · open-source · 2026-04-21

Production-Grade LLM Inference at Scale with KServe, llm-d, and vLLM

Score 14

How migrating from a simple vLLM deployment to a robust MLOps platform utilizing KServe, llm-d's intelligent routing, and vLLM solved significant scaling and operational challenges in LLM deployment through deep customization and prefix-ca...

inference hardware

Open

High signal Matched: inference, gpu

llm-d · open-source · 2026-03-13

Predicted-Latency Based Scheduling for LLMs

Score 18

A lightweight ML model trained online from live traffic replaces manually tuned heuristic weights with direct latency predictions, achieving 43% improvement in P50 end-to-end latency and 70% improvement in TTFT on a production-realistic wo...

benchmark model-release

Open

High signal Matched: latency, ttft, model, weights

llm-d · open-source · 2026-02-10

Native KV Cache Offloading to Any Filesystem with llm-d

Score 20

llm-d's new filesystem backend offloads KV cache to shared storage, enabling cross-replica reuse and up to 16.8x faster TTFT — scaling inference throughput without GPU or CPU memory limits.

inference serving kv-cache benchmark hardware

Open

High signal Matched: inference, throughput, kv cache, ttft, gpu

llm-d · open-source · 2026-02-04

llm-d 0.5: Sustaining Performance at Scale

Score 16

llm-d v0.5 introduces hierarchical KV-cache offloading, LoRA-aware scheduling, UCCL networking, and scale-to-zero autoscaling for sustained inference performance at scale.

inference benchmark fine-tuning

Open

High signal Matched: inference, performance, lora

llm-d · open-source · 2025-12-02

llm-d 0.4: Achieve SOTA Performance Across Accelerators

Score 30

llm-d v0.4 delivers 50% lower latency for MoE models via speculative decoding, expands TPU and XPU support, and adds prefix cache offloading for faster TTFT.

inference kv-cache speculative-decoding moe benchmark hardware frontier-model

Open

High signal Matched: decoding, prefix cache, speculative decoding, moe, performance, latency, ttft, tpu, sota

llm-d · open-source · 2025-10-10

llm-d 0.3: Wider Well-Lit Paths for Scalable Inference

Score 20

llm-d v0.3 adds Google TPU and Intel XPU support, wide expert parallelism at 2.2k tokens/sec per GPU, predicted latency scheduling, and Inference Gateway GA.

inference benchmark hardware

Open

High signal Matched: inference, latency, tokens/sec, gpu, tpu

llm-d · open-source · 2025-09-24

KV-Cache Wins You Can See: From Prefix Caching in vLLM to Distributed Scheduling with llm-d

Score 18

See how llm-d's precise KV-cache aware scheduling delivers 57x faster responses and 2x throughput in production distributed LLM inference benchmarks.

inference serving distributed benchmark evals

Open

High signal Matched: inference, throughput, distributed, benchmarks

llm-d · open-source · 2025-09-03

Intelligent Inference Scheduling with llm-d

Score 16

Learn how llm-d's intelligent inference scheduling uses prefix-aware, load-balanced routing to maximize LLM throughput and minimize latency on Kubernetes.

inference serving benchmark

Open

High signal Matched: inference, throughput, latency

llm-d · open-source · 2025-07-29

llm-d 0.2: Our first well-lit paths (mind the tree roots!)

Score 10

llm-d v0.2 introduces well-lit paths for Kubernetes LLM deployment: intelligent scheduling, P/D disaggregation, and MoE support with vLLM optimizations.

moe

Open

High signal Matched: moe

llm-d · open-source · 2025-06-25

llm-d Community Update - June 2025

Score 10

Help shape llm-d's future: Take our 5-minute community survey, subscribe to our YouTube channel, and access exclusive resources for LLM serving innovation.

inference serving

Open

High signal Matched: serving

llm-d · open-source · 2025-06-03

llm-d Week 1 Project News Round-Up

Score 12

llm-d hits 1000 GitHub stars! Week 1-2 round-up covers KVTransfer Protocol, InferenceModel API updates, and community resources for LLM inference developers.

inference api

Open

High signal Matched: inference, api

llm-d · open-source · 2025-05-20

Announcing the llm-d community!

Score 20

Introducing llm-d: Kubernetes-native distributed LLM inference with KV-cache routing, disaggregated serving, and SOTA performance per dollar. Built on vLLM.

inference serving distributed benchmark model-release frontier-model

Open

High signal Matched: inference, serving, distributed, performance, introducing, sota

llm-d · open-source · 2025-05-20

llm-d Press Release

Score 20

Red Hat launches llm-d: Open source distributed AI inference platform backed by NVIDIA, Google Cloud, IBM. Scale generative AI with intelligent routing on Kubernetes.

inference distributed model-release cloud open-source

Open

High signal Matched: inference, distributed, release, cloud, open source