Mooncake · open-source · 2026-06-03
SGLang HiCache with Mooncake Backend Benchmark
No feed summary available yet.
High signal Matched: hicache, benchmark
Mooncake · open-source · 2026-06-03
No feed summary available yet.
High signal Matched: hicache, benchmark
NVIDIA Dynamo · open-source · 2026-06-03
No feed summary available yet.
High signal Matched: kv cache
NVIDIA Dynamo · open-source · 2026-06-03
No feed summary available yet.
High signal Matched: kv cache
Mooncake · open-source · 2026-06-03
No feed summary available yet.
High signal Matched: hicache
Mooncake · open-source · 2026-06-03
No feed summary available yet.
High signal Matched: lmcache
FriendliAI · inference-infra · 2026-06-03
No feed summary available yet.
High signal Matched: inference, kv cache, gpu
LMCache · open-source · 2026-06-03
TL;DR: A key contributor to the LMCache community just secured a major investment. This will greatly accelerate our mission of building the best KV cache library for every developer. Come join us in building the future AI-native data layer...
High signal Matched: kv cache, lmcache
LMCache · open-source · 2026-05-27
A collaboration story about LMCache multiprocess mode + MooncakeStore — From 0 to 1, from functional to optimized. 1. Before We Begin Recently, the LMCache community and the Mooncake community carried out a series of valuable open-source c...
High signal Matched: lmcache, adapter, open-source, open source
LMCache · open-source · 2026-05-21
A new system stack is quietly taking shape around LLM serving. What makes it interesting is not just how quickly it is evolving, but how familiar the shape of that evolution looks if you’ve spent time studying large-scale systems like the...
High signal Matched: serving, lmcache, api
vLLM Project · open-source · 2026-05-18
TL;DR: In collaboration with Novita AI, PegaFlow integrates with vLLM as an external KV cache service for LLM inference, implemented as a standalone Rust process and connected through the external...
High signal Matched: inference, kv cache
vLLM Project · open-source · 2026-05-14
Expert parallelism (EP) is a key technique for serving Mixture-of-Experts (MoE) models at high throughput. WideEP deployments (where EP spans many workers) maximize KV cache capacity, enabling...
High signal Matched: serving, throughput, kv cache, moe
LMCache · open-source · 2026-05-13
A practitioner’s guide to KV-cache tiering on ROCm — what works, what doesn’t, and the regime where it actually matters. Key Summary We benchmarked multi-turn agentic workloads using 739 anonymized Claude Code conversation trac...
High signal Matched: lmcache, moe, mi300x, rocm, fp8, agentic
BAIR · research · 2026-05-08
.apr-fig { text-align: center; margin: 1.35em 0; line-height: 1.4; } .apr-fig--wide img { display: inline-block; width: 100%; max-width: 100%; height: auto; vertical-align: middle; } .apr-fig--wide-0-8 { max-width: 80%; margin-left: auto;...
High signal Matched: inference, decoding, prefill, generation, serve, throughput, kv cache, verification, performance, latency, cost, model, paper, research, evaluation, training, pretraining, sft, benchmarks, long context, context window, agentic, reasoning model
vLLM Project · open-source · 2026-05-06
TL;DR: Agentic workloads generate massive shared prefixes that are often recomputed across turns. By integrating Mooncake's distributed KV cache store into vLLM, we achieve 3.8x higher throughput,...
High signal Matched: serving, throughput, distributed, kv cache, agentic
LMCache · open-source · 2026-05-05
DeepSeek V4 — an open weight model that gives you the state-of-the-art intelligence, while potentially gives you much cheaper token price than its preceding model, DeepSeek V3.2. But how does DeepSeek v4 does that? Pre-requisite: attention...
High signal Matched: kv cache, lmcache, model
LMCache · open-source · 2026-04-29
For years, we have referred to one of the most critical components of modern LLM inference as a “KV cache.” That name made sense once. Today, it is increasingly misleading. What began as a small, ephemeral optimization inside a...
High signal Matched: inference, kv cache, lmcache
LMCache · open-source · 2026-04-23
Overview Large language model (LLM) inference performance depends heavily on how efficiently the system manages key-value (KV) cache — the stored attention states that allow the model to avoid recomputing previous tokens. As context length...
High signal Matched: inference, kv cache, lmcache, performance, latency, gpu, model, sagemaker
vLLM Project · open-source · 2026-04-22
Long-context LLM serving is increasingly memory-bound: for standard full-attention decoders, the KV cache often dominates GPU memory at 128k+ contexts, and each decode step must read a large...
High signal Matched: serving, kv cache, gpu, fp8, quantization, long-context
LMCache · open-source · 2026-04-18
GTC wrapped up a month ago. Our open-source KV cache management library, LMCache, was shown in Jensen Huang’s keynote, was spotlighted by NVIDIA SVP Kevin Deierling, I was invited to speak at the first-ever industry KV cache tutorial...
High signal Matched: kv cache, lmcache, open-source
LMCache · open-source · 2026-04-16
TL;DR: TurboQuant allows you to put 4x more context in your GPU without blowing up GPU memory or dropping AI’s intelligence. It does so by quantizing the memory of large language models, also known as KV cache, an important bottleneck ment...
High signal Matched: inference, kv cache, lmcache, gpu
Nota AI · korea · 2026-04-08
Jaehoon Lee Technical Content Manager, Nota AI AI Model Optimization: Why Models Won't Run on HardwareThe Chip Is Ready, but the Model Won't DeployIf you have ever tried deploying an AI model onto your own chip, the following...
High signal Matched: inference, multi-gpu, kv cache, verification, performance, latency, gpu, model, research, evaluation, quantization, quantized, awq, gptq, evaluate
LMCache · open-source · 2026-04-04
Modern LLM serving workloads are defined by strict latency requirements, high concurrency, and rapidly growing context lengths. Applications such as multi-turn chat, AI agents, and retrieval-augmented generation continuously build on prior...
High signal Matched: inference, serving, decoding, generation, throughput, lmcache, moe, performance, latency, ttft, retrieval-augmented generation, retrieval, agents
Nota AI · korea · 2026-03-31
Jaehoon Lee Technical Content Manager, Nota AI In March, a single official announcement from Google Research rocked trillions of won in the market capitalization of U.S. infrastructure and semiconductor stocks. The catalyst:...
High signal Matched: inference, serving, generation, throughput, kv cache, benchmark, performance, cost, b200, blackwell, introducing, model, fp8, research, training, fine-tuning, quantization, quantized, agent, agentic, frontier model
Nota AI · korea · 2026-03-23
Jaehoon Lee Technical Content Manager, Nota AI GTC has evolved far beyond a technology conference, drawing attention from global economies and financial markets alike. This year, CEO Jensen Huang took the stage in his tradema...
High signal Matched: inference, prefill, generation, throughput, cuda, kv cache, performance, latency, cost, gpu, npu, launch, model, research, cloud, training, long-context, context window, agent, agents, agentic, open-source
Nota AI · korea · 2026-03-20
NP Product Team, Nota AI The role of Edge AI is rapidly expanding.Offline voice assistants now carry on conversations in our daily lives, vehicles infer routes in real time, and smartphones generate images without a network c...
High signal Matched: inference, kv cache, moe, benchmark, performance, latency, cost, model, research, seoul, quantization
llm-d · open-source · 2026-02-10
llm-d's new filesystem backend offloads KV cache to shared storage, enabling cross-replica reuse and up to 16.8x faster TTFT — scaling inference throughput without GPU or CPU memory limits.
High signal Matched: inference, throughput, kv cache, ttft, gpu
vLLM Project · open-source · 2026-01-08
In this post, we will describe the new KV cache offloading feature that was introduced in vLLM 0.11.0. We will focus on offloading to CPU memory (DRAM) and its benefits to improving overall...
High signal Matched: inference, throughput, kv cache
llm-d · open-source · 2025-12-02
llm-d v0.4 delivers 50% lower latency for MoE models via speculative decoding, expands TPU and XPU support, and adds prefix cache offloading for faster TTFT.
High signal Matched: decoding, prefix cache, speculative decoding, moe, performance, latency, ttft, tpu, sota
Hugging Face · open-source · 2025-06-04
No feed summary available yet.
High signal Matched: kv cache
AIBrix · open-source · 2025-05-22
AIBrix is a composable, cloud-native AI infrastructure toolkit designed to power scalable and cost-effective large language model (LLM) inference. As production demands for memory-efficient and latency-aware LLM services continue to grow,...
High signal Matched: inference, prefix cache, latency, cost, release, model, cloud
Nota AI · korea · 2025-05-07
Jewon Lee | Ki-Ung Song | Seungmin Yang | Donguk Lim | Jaeyeon Kim | Wooksu Shin | Bo-Kyeong Kim | Tae-Ho KimEdgeFM Team, Nota AI Yong Jae Lee, Ph. D.Associate Professor, UW-Madison SummaryOur method, Trimmed-Llama, reduces t...
High signal Matched: inference, generation, kv cache, benchmark, performance, latency, model, weights, research, training, benchmarks, open-source
AIBrix · open-source · 2025-02-19
We’re excited to announce the v0.2.0 release of AIBrix! Building on feedback from v0.1.0 production adoption and user interest, this release introduces several new features to enhance performance and usability. Extend the vLLM Prefix...
High signal Matched: inference, serving, prefill, throughput, distributed, multi-node, kv cache, prefix cache, performance, cost, gpu, accelerator, release, agent
Modular · inference-infra · 2025-02-06
Paged Attention & Prefix Caching Now Available in MAX Serve
High signal Matched: serve, paged attention
SqueezeBits · korea · 2024-11-18
This article provides a comparative analysis of the effects of KV cache quantization on vLLM and TensorRT-LLM frameworks.
High signal Matched: kv cache, quantization
AIBrix · open-source · 2024-11-13
In recent years, large language models (LLMs) have revolutionized AI applications, powering solutions in areas like chatbots, automated content generation, and advanced recommendation engines. Services like OpenAI’s have gained significant...
High signal Matched: decoding, prefill, generation, kv cache, performance, cost, gpu, release, introducing, cloud, open-source