kv-cache - MLSys Blogs

TL;DR: A key contributor to the LMCache community just secured a major investment. This will greatly accelerate our mission of building the best KV cache library for every developer. Come join us in building the future AI-native data layer...

kv-cache

Open

High signal Matched: kv cache, lmcache

LMCache · open-source · 2026-05-27

When Open Source Meets Open Source: A Joint Effort Between LMCache and Mooncake

Score 11

A collaboration story about LMCache multiprocess mode + MooncakeStore — From 0 to 1, from functional to optimized. 1. Before We Begin Recently, the LMCache community and the Mooncake community carried out a series of valuable open-source c...

kv-cache fine-tuning open-source

Open

High signal Matched: lmcache, adapter, open-source, open source

LMCache · open-source · 2026-05-21

OpenAI API Is the New IPv4

Score 10

A new system stack is quietly taking shape around LLM serving. What makes it interesting is not just how quickly it is evolving, but how familiar the shape of that evolution looks if you’ve spent time studying large-scale systems like the...

inference serving kv-cache api

Open

High signal Matched: serving, lmcache, api

vLLM Project · open-source · 2026-05-18

vLLM x Novita AI: PegaFlow for Production-Grade External KV Cache

Score 14

TL;DR: In collaboration with Novita AI, PegaFlow integrates with vLLM as an external KV cache service for LLM inference, implemented as a standalone Rust process and connected through the external...

inference kv-cache

Open

High signal Matched: inference, kv cache

vLLM Project · open-source · 2026-05-14

Elastic Expert Parallelism in vLLM

Score 16

Expert parallelism (EP) is a key technique for serving Mixture-of-Experts (MoE) models at high throughput. WideEP deployments (where EP spans many workers) maximize KV cache capacity, enabling...

inference serving kv-cache moe benchmark

Open

High signal Matched: serving, throughput, kv cache, moe

LMCache · open-source · 2026-05-13

Benchmarking LMCache for Multi-Turn Agentic Workloads on AMD MI300X

Score 20

A practitioner’s guide to KV-cache tiering on ROCm — what works, what doesn’t, and the regime where it actually matters. Key Summary We benchmarked multi-turn agentic workloads using 739 anonymized Claude Code conversation trac...

kv-cache moe hardware model-release quantization agents

Open

High signal Matched: lmcache, moe, mi300x, rocm, fp8, agentic

BAIR · research · 2026-05-08

Adaptive Parallel Reasoning: The Next Paradigm in Efficient Inference Scaling

Score 28

.apr-fig { text-align: center; margin: 1.35em 0; line-height: 1.4; } .apr-fig--wide img { display: inline-block; width: 100%; max-width: 100%; height: auto; vertical-align: middle; } .apr-fig--wide-0-8 { max-width: 80%; margin-left: auto;...

inference serving kv-cache speculative-decoding benchmark model-release research training fine-tuning evals long-context agents frontier-model

Open

High signal Matched: inference, decoding, prefill, generation, serve, throughput, kv cache, verification, performance, latency, cost, model, paper, research, evaluation, training, pretraining, sft, benchmarks, long context, context window, agentic, reasoning model

vLLM Project · open-source · 2026-05-06

Serving Agentic Workloads at Scale with vLLM x Mooncake

Score 18

TL;DR: Agentic workloads generate massive shared prefixes that are often recomputed across turns. By integrating Mooncake's distributed KV cache store into vLLM, we achieve 3.8x higher throughput,...

inference serving distributed kv-cache benchmark agents

Open

High signal Matched: serving, throughput, distributed, kv cache, agentic

LMCache · open-source · 2026-05-05

Deepseek V4 explained, and why it matters to your wallet

Score 12

DeepSeek V4 — an open weight model that gives you the state-of-the-art intelligence, while potentially gives you much cheaper token price than its preceding model, DeepSeek V3.2. But how does DeepSeek v4 does that? Pre-requisite: attention...

kv-cache model-release

Open

High signal Matched: kv cache, lmcache, model

LMCache · open-source · 2026-04-29

Stop Calling It KV Cache: It’s Something Much Bigger

Score 14

For years, we have referred to one of the most critical components of modern LLM inference as a “KV cache.” That name made sense once. Today, it is increasingly misleading. What began as a small, ephemeral optimization inside a...

inference kv-cache

Open

High signal Matched: inference, kv cache, lmcache

LMCache · open-source · 2026-04-23

LMCache on Amazon SageMaker HyperPod: Accelerating LLM Inference with Managed Tiered KV Cache

Score 30

Overview Large language model (LLM) inference performance depends heavily on how efficiently the system manages key-value (KV) cache — the stored attention states that allow the model to avoid recomputing previous tokens. As context length...

inference kv-cache benchmark hardware model-release cloud

Open

High signal Matched: inference, kv cache, lmcache, performance, latency, gpu, model, sagemaker

vLLM Project · open-source · 2026-04-22

The State of FP8 KV-Cache and Attention Quantization in vLLM

Score 18

Long-context LLM serving is increasingly memory-bound: for standard full-attention decoders, the KV cache often dominates GPU memory at 128k+ contexts, and each decode step must read a large...

inference serving kv-cache hardware model-release quantization long-context

Open

High signal Matched: serving, kv cache, gpu, fp8, quantization, long-context

LMCache · open-source · 2026-04-18

LMCache: A Journey

Score 12

GTC wrapped up a month ago. Our open-source KV cache management library, LMCache, was shown in Jensen Huang’s keynote, was spotlighted by NVIDIA SVP Kevin Deierling, I was invited to speak at the first-ever industry KV cache tutorial...

kv-cache open-source

Open

High signal Matched: kv cache, lmcache, open-source

LMCache · open-source · 2026-04-16

What is TurboQuant and why it matters for LLM inference, in laymen’s term

Score 16

TL;DR: TurboQuant allows you to put 4x more context in your GPU without blowing up GPU memory or dropping AI’s intelligence. It does so by quantizing the memory of large language models, also known as KV cache, an important bottleneck ment...

inference kv-cache hardware

Open

High signal Matched: inference, kv cache, lmcache, gpu

Nota AI · korea · 2026-04-08

[Overview: NetsPresso®] A Platform That Handles Everything from Model Optimization to Target Deployment

Score 36

  Jaehoon Lee Technical Content Manager, Nota AI   AI Model Optimization: Why Models Won't Run on HardwareThe Chip Is Ready, but the Model Won't DeployIf you have ever tried deploying an AI model onto your own chip, the following...

inference distributed kv-cache speculative-decoding benchmark hardware model-release research quantization evals

Open

High signal Matched: inference, multi-gpu, kv cache, verification, performance, latency, gpu, model, research, evaluation, quantization, quantized, awq, gptq, evaluate

LMCache · open-source · 2026-04-04

LMCache’s New Architecture Boosts MoE Inference Performance by 10×

Score 34

Modern LLM serving workloads are defined by strict latency requirements, high concurrency, and rapidly growing context lengths. Applications such as multi-turn chat, AI agents, and retrieval-augmented generation continuously build on prior...

inference serving kv-cache moe benchmark rag agents

Open

High signal Matched: inference, serving, decoding, generation, throughput, lmcache, moe, performance, latency, ttft, retrieval-augmented generation, retrieval, agents

Nota AI · korea · 2026-03-31

The Real Reason TurboQuant Shook the Market: AI Optimization Has Gone Mainstream

Score 46

  Jaehoon Lee Technical Content Manager, Nota AI   In March, a single official announcement from Google Research rocked trillions of won in the market capitalization of U.S. infrastructure and semiconductor stocks. The catalyst:...

inference serving kv-cache benchmark hardware model-release research training fine-tuning quantization agents frontier-model

Open

High signal Matched: inference, serving, generation, throughput, kv cache, benchmark, performance, cost, b200, blackwell, introducing, model, fp8, research, training, fine-tuning, quantization, quantized, agent, agentic, frontier model

Nota AI · korea · 2026-03-23

[GTC 2026 Recap] The Trillion-Dollar Inference Race Begins: How Nota AI Fills the Gap

Score 42

  Jaehoon Lee Technical Content Manager, Nota AI   GTC has evolved far beyond a technology conference, drawing attention from global economies and financial markets alike. This year, CEO Jensen Huang took the stage in his tradema...

inference serving kernel cuda kv-cache benchmark hardware model-release research cloud training long-context agents open-source

Open

High signal Matched: inference, prefill, generation, throughput, cuda, kv cache, performance, latency, cost, gpu, npu, launch, model, research, cloud, training, long-context, context window, agent, agents, agentic, open-source

Nota AI · korea · 2026-03-20

GenAI Everywhere: The Future of Edge AI Optimization with the New NetsPresso®

Score 26

  NP Product Team, Nota AI   The role of Edge AI is rapidly expanding.Offline voice assistants now carry on conversations in our daily lives, vehicles infer routes in real time, and smartphones generate images without a network c...

inference kv-cache moe benchmark model-release research korea quantization

Open

High signal Matched: inference, kv cache, moe, benchmark, performance, latency, cost, model, research, seoul, quantization

llm-d · open-source · 2026-02-10

Native KV Cache Offloading to Any Filesystem with llm-d

Score 20

llm-d's new filesystem backend offloads KV cache to shared storage, enabling cross-replica reuse and up to 16.8x faster TTFT — scaling inference throughput without GPU or CPU memory limits.

inference serving kv-cache benchmark hardware

Open

High signal Matched: inference, throughput, kv cache, ttft, gpu

vLLM Project · open-source · 2026-01-08

Inside vLLM’s New KV Offloading Connector: Smarter Memory Transfer for Maximizing Inference Throughput

Score 18

In this post, we will describe the new KV cache offloading feature that was introduced in vLLM 0.11.0. We will focus on offloading to CPU memory (DRAM) and its benefits to improving overall...

inference serving kv-cache benchmark

Open

High signal Matched: inference, throughput, kv cache

llm-d · open-source · 2025-12-02

llm-d 0.4: Achieve SOTA Performance Across Accelerators

Score 30

llm-d v0.4 delivers 50% lower latency for MoE models via speculative decoding, expands TPU and XPU support, and adds prefix cache offloading for faster TTFT.

inference kv-cache speculative-decoding moe benchmark hardware frontier-model

Open

High signal Matched: decoding, prefix cache, speculative decoding, moe, performance, latency, ttft, tpu, sota

Hugging Face · open-source · 2025-06-04

KV Cache from scratch in nanoVLM

Score 10

No feed summary available yet.

kv-cache

Open

High signal Matched: kv cache

AIBrix · open-source · 2025-05-22

AIBrix v0.3.0 Release: KVCache Offloading, Prefix Cache, Fairness Routing, and Benchmarking Tools

Score 24

AIBrix is a composable, cloud-native AI infrastructure toolkit designed to power scalable and cost-effective large language model (LLM) inference. As production demands for memory-efficient and latency-aware LLM services continue to grow,...

inference kv-cache benchmark model-release cloud

Open

High signal Matched: inference, prefix cache, latency, cost, release, model, cloud

Nota AI · korea · 2025-05-07

Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features</span#x3E;

Score 28

inference kv-cache benchmark model-release research training evals open-source

Open

High signal Matched: inference, generation, kv cache, benchmark, performance, latency, model, weights, research, training, benchmarks, open-source

AIBrix · open-source · 2025-02-19

AIBrix v0.2.0 Release: Distributed KV Cache, Orchestration and Heterogeneous GPU Support

Score 42

We’re excited to announce the v0.2.0 release of AIBrix! Building on feedback from v0.1.0 production adoption and user interest, this release introduces several new features to enhance performance and usability. Extend the vLLM Prefix...

inference serving distributed kv-cache benchmark hardware model-release agents

Open

High signal Matched: inference, serving, prefill, throughput, distributed, multi-node, kv cache, prefix cache, performance, cost, gpu, accelerator, release, agent

Modular · inference-infra · 2025-02-06

Paged Attention & Prefix Caching Now Available in MAX Serve

Score 14

Paged Attention & Prefix Caching Now Available in MAX Serve

serving kv-cache

Open

High signal Matched: serve, paged attention

SqueezeBits · korea · 2024-11-18

[vLLM vs TensorRT-LLM] #8. KV Cache Quantization

Score 14

This article provides a comparative analysis of the effects of KV cache quantization on vLLM and TensorRT-LLM frameworks.

kv-cache quantization

Open

High signal Matched: kv cache, quantization

AIBrix · open-source · 2024-11-13

Introducing AIBrix v0.1.0: Building the Future of Scalable, Cost-Effective AI Infrastructure for Large Models

Score 32

In recent years, large language models (LLMs) have revolutionized AI applications, powering solutions in areas like chatbots, automated content generation, and advanced recommendation engines. Services like OpenAI’s have gained significant...

inference kv-cache benchmark hardware model-release cloud open-source

Open

High signal Matched: decoding, prefill, generation, kv cache, performance, cost, gpu, release, introducing, cloud, open-source