Official blog for vLLM, a high-throughput LLM serving engine.
vLLM Project · open-source · 2026-06-02
Score 15
Long-horizon LLM agents create a routing problem that single-turn prompt routers were not designed to solve. A router still needs to know which model is best for the current request, but it also...
High signal Matched: router, model, agents, agentic
vLLM Project · open-source · 2026-06-02
Score 13
We are excited to announce that AutoRound — Intel's state-of-the-art post-training quantization (PTQ) algorithm — is now fully integrated into vLLM-Omni, enabling a streamlined quantize-once,...
High signal Matched: inference, training, post-training, quantization
vLLM Project · open-source · 2026-06-01
Score 17
A technical deep dive on running vLLM on NVIDIA DGX Spark and GB10 systems, covering sm_121 architecture, unified memory behavior, NVFP4 model serving, Nemotron-3-Super configuration, Docker deployment, Prometheus metrics, and local evalua...
High signal Matched: serving, model, evaluation
vLLM Project · open-source · 2026-05-28
Score 19
The v0.5.0 release brings significant architectural improvements to speculative decoding model training, introducing DFlash algorithm support, fully unified online training capabilities, and a...
High signal Matched: decoding, speculative decoding, release, introducing, model, training
vLLM Project · open-source · 2026-05-28
Score 19
Most routing systems start with a prompt and choose a model endpoint. vLLM Semantic Router (VSR) makes a different bet: before a request reaches the serving model, the system should extract...
High signal Matched: serving, endpoint, router, model
vLLM Project · open-source · 2026-05-28
Score 15
As organizations increasingly adopt AI-powered development tools, the need for high-performance agentic models that deliver both accuracy and operational efficiency has become critical. Laguna...
High signal Matched: inference, performance, agentic
vLLM Project · open-source · 2026-05-28
Score 11
As post-training workloads continue to scale, we've seen widespread adoption of vLLM as the inference engine of choice. However, two issues repeatedly arise:
High signal Matched: inference, training, post-training
vLLM Project · open-source · 2026-05-26
Score 22
The EAGLE series — including EAGLE 1, EAGLE 2, and EAGLE 3 — has become one of the most widely adopted and practically deployed families of speculative decoding algorithms across both research and...
High signal Matched: decoding, speculative decoding, eagle, research
vLLM Project · open-source · 2026-05-18
Score 14
TL;DR: In collaboration with Novita AI, PegaFlow integrates with vLLM as an external KV cache service for LLM inference, implemented as a standalone Rust process and connected through the external...
High signal Matched: inference, kv cache
vLLM Project · open-source · 2026-05-14
Score 16
Expert parallelism (EP) is a key technique for serving Mixture-of-Experts (MoE) models at high throughput. WideEP deployments (where EP spans many workers) maximize KV cache capacity, enabling...
High signal Matched: serving, throughput, kv cache, moe
vLLM Project · open-source · 2026-05-14
Score 10
We are excited to announce the pre-release of VeRL-Omni, a general reinforcement learning (RL) post-training framework focused on multimodal generative models, built on top of verl and vllm-omni.
High signal Matched: release, training, post-training
vLLM Project · open-source · 2026-05-11
Score 14
TurboQuant, a method for KV-cache quantization, recently gained significant traction in the community due to the large advertised savings in GPU memory from very low bit-width quantization of a...
High signal Matched: performance, gpu, quantization
vLLM Project · open-source · 2026-05-06
Score 18
TL;DR: Agentic workloads generate massive shared prefixes that are often recomputed across turns. By integrating Mooncake's distributed KV cache store into vLLM, we achieve 3.8x higher throughput,...
High signal Matched: serving, throughput, distributed, kv cache, agentic
vLLM Project · open-source · 2026-04-28
Score 10
We are excited to support the newly released NVIDIA Nemotron 3 Nano Omni model on vLLM.
High signal Matched: model, agentic
vLLM Project · open-source · 2026-04-22
Score 18
Long-context LLM serving is increasingly memory-bound: for standard full-attention decoders, the KV cache often dominates GPU memory at 128k+ contexts, and each decode step must read a large...
High signal Matched: serving, kv cache, gpu, fp8, quantization, long-context
vLLM Project · open-source · 2026-04-21
Score 12
Hybrid architectures that interleave Mamba-style SSM layers with standard full-attention (FA) layers — such as NVIDIA Nemotron-H — are gaining traction as a way to combine the linear-time...
High signal Matched: serving
vLLM Project · open-source · 2026-04-14
Score 16
Hosted by the vLLM KR Community, with support from Rebellions, SqueezeBits, Red Hat APAC, and PyTorch Korea, the vLLM Korea Meetup 2026 was held in Seoul on April 2nd.
High signal Matched: korea, seoul, rebellions
vLLM Project · open-source · 2026-04-07
Score 22
TL;DR: Prefill and decode fight over the same GPUs, causing ITL spikes under load. We show how to disaggregate them on a single 8-GPU MI300X node using AMD's MORI-IO connector — achieving 2.5x...
High signal Matched: inference, prefill, itl, gpu, mi300x
vLLM Project · open-source · 2026-04-02
Score 16
With the debut of Gemma 4, vLLM introduces immediate support for Google's most sophisticated open model lineup, spanning multiple hardware backends, with first-ever Day 0 support on Google TPUs,...
High signal Matched: model, open model
vLLM Project · open-source · 2026-03-24
Score 12
We are excited to announce Model Runner V2 (MRV2), a ground-up re-implementation of the vLLM model runner. MRV2 delivers a cleaner, more modular, and more efficient execution core—with no API...
High signal Matched: model, api
vLLM Project · open-source · 2026-03-13
Score 26
EAGLE is the state-of-the-art method for speculative decoding in large language model (LLM) inference, but its autoregressive drafting creates a hidden bottleneck: the more tokens that you...
High signal Matched: inference, decoding, speculative decoding, eagle, model
vLLM Project · open-source · 2026-03-11
Score 10
We are excited to support the newly released NVIDIA Nemotron 3 Super model on vLLM.
High signal Matched: model, agent
vLLM Project · open-source · 2026-03-10
Score 18
Since v0.1 Iris, vLLM Semantic Router has made a large jump. In one release cycle, the project rebuilt its model stack, expanded routing into safety, semantic caching, memory, retrieval, and...
High signal Matched: router, release, model, retrieval
vLLM Project · open-source · 2026-03-04
Score 14
This article is adapted from a Red Hat hosted vLLM Office Hours session with Burkhard Ringlein from IBM Research, featuring a deep technical walkthrough of the vLLM Triton attention backend....
High signal Matched: triton, research
vLLM Project · open-source · 2026-02-27
Score 20
For a long time, enabling AMD support meant "porting"; i.e. just making code run. That era is over.
High signal Matched: inference, performance, rocm
vLLM Project · open-source · 2026-02-26
Score 30
Organizations and individuals running multiple custom AI models, especially recent Mixture of Experts (MoE) model families, can face the challenge of paying for idle GPU capacity when the...
High signal Matched: serve, moe, mixture of experts, gpu, model, sagemaker, bedrock
vLLM Project · open-source · 2026-02-13
Score 22
DeepSeek-V3.2 (NVFP4 + TP2)has been successfully and smoothly run on GB300 (SM103 - Blackwell Ultra). Leveraging FP4 quantization, it achieves a single-GPU throughput of 7360 TGS (tokens / GPU /...
High signal Matched: throughput, deepseek-v3, performance, gpu, blackwell, quantization
vLLM Project · open-source · 2026-02-03
Score 24
Building on our previous work achieving 2.2k tok/s/H200 decode throughput with wide-EP, the vLLM team has continued performance optimization efforts targeting NVIDIA's GB200 platform. This blog...
High signal Matched: serving, throughput, performance, h200, gb200, blackwell
vLLM Project · open-source · 2026-02-01
Score 18
TL;DR: In collaboration with the open-source community, vLLM + NVIDIA has achieved significant performance milestones on the gpt-oss-120b model running on NVIDIA's Blackwell GPUs. Through deep...
High signal Matched: performance, blackwell, model, open-source, oss
vLLM Project · open-source · 2026-01-31
Score 12
Large language model inference has traditionally operated on a simple premise: the user submits a complete prompt (request), the model processes it, and returns a response (either streaming or at...
High signal Matched: inference, model, api
vLLM Project · open-source · 2026-01-08
Score 18
In this post, we will describe the new KV cache offloading feature that was introduced in vLLM 0.11.0. We will focus on offloading to CPU memory (DRAM) and its benefits to improving overall...
High signal Matched: inference, throughput, kv cache
vLLM Project · open-source · 2026-01-05
Score 16
vLLM Semantic Router is the System Level Intelligence for Mixture-of-Models (MoM), bringing Collective Intelligence into LLM systems. It lives between users and models, capturing signals from...
High signal Matched: router, release
vLLM Project · open-source · 2026-01-02
Score 12
As a passionate vLLM community member who wants to see vLLM thrive and reach even more developers, I'm excited to announce vLLM Playground – a modern, feature-rich web interface for managing and...
High signal Matched: introducing
vLLM Project · open-source · 2025-12-19
Score 10
We are thrilled to announce a major performance update for vLLM-Omni.
High signal Matched: performance
vLLM Project · open-source · 2025-12-17
Score 16
In v0.11.0, the last code from vLLM V0 engine was removed, marking the complete migration to the improved V1 engine architecture. This achievement would not have been possible without vLLM’s...
High signal Matched: serving, h200
vLLM Project · open-source · 2025-12-16
Score 14
Over the past several months, AMD and the vLLM SR Team have been collaborating to bring vLLM Semantic Router (VSR) to AMD GPUs—not just as a performance optimization, but as a fundamental shift in...
High signal Matched: router, performance
vLLM Project · open-source · 2025-12-15
Score 18
Modern Large Multimodal Models (LMMs) introduce a unique serving-time bottleneck: before any text generation can begin, all images must be processed by a visual encoder (e.g., ViT). This encoder...
High signal Matched: serving, generation, model
vLLM Project · open-source · 2025-12-15
Score 10
Jan 28th Update: NVIDIA just released their Nemotron 3 Nano model in NVFP4 precision. This model is supported by vLLM out of the box and it uses a new method called Quantization-Aware Distillation...
High signal Matched: model, quantization, agents
vLLM Project · open-source · 2025-12-13
Score 26
Efficiently managing request distribution across a fleet of model replicas is a critical requirement for large-scale, production vLLM deployments. Standard load balancers often fall short as they...
High signal Matched: serving, prefill, router, performance, model
vLLM Project · open-source · 2025-12-13
Score 24
- Speculative decoding serves as an optimization to improve inference performance; however, training a unique draft model for each LLM can be difficult and time-consuming, while production-ready...
High signal Matched: inference, decoding, speculative decoding, draft model, performance, model, training
vLLM Project · open-source · 2025-12-09
Score 10
Achieve faster, more efficient LLM serving without sacrificing accuracy!
High signal Matched: serving, quantization
vLLM Project · open-source · 2025-12-03
Score 16
Several months ago, we published a blog post about CUDA Core Dump: An Effective Tool to Debug Memory Access Issues and Beyond, introducing a powerful technique for debugging illegal memory access...
High signal Matched: cuda, gpu, introducing
vLLM Project · open-source · 2025-11-30
Score 20
We are excited to announce the official release of vLLM-Omni, a major extension of the vLLM ecosystem designed to support the next generation of AI: omni-modality models.
High signal Matched: serving, generation, release, model
vLLM Project · open-source · 2025-11-22
Score 18
Ray now has a new command: ray symmetric-run. This command makes it possible to launch the same entrypoint command on every node in a Ray cluster, simplifying the workflow to spawn vLLM servers...
High signal Matched: serving, multi-node, launch
vLLM Project · open-source · 2026-05-11
Score 3
How vLLM built the leading deployments of DeepSeek V3.2, MiniMax-M2.5, and Qwen 3.5 397B.
Watchlist Matched: leaderboard
vLLM Project · open-source · 2026-04-24
Score 3
A first-principles walkthrough of DeepSeek V4's long-context attention, and how we implemented it in vLLM.
Watchlist Matched: long-context
vLLM Project · open-source · 2026-03-30
Score 3
PR #33736 (included in vllm>=v0.18.0) introduced a new hidden states extraction system to vLLM. This blog post explores the motivation, design, usage, and future direction of this feature, and its...
Watchlist Matched: none
vLLM Project · open-source · 2026-01-23
Score 3
We are working on building the System Level Intelligence for Mixture-of-Models (MoM), bringing Collective Intelligence into LLM systems.
Watchlist Matched: none
vLLM Project · open-source · 2025-12-27
Score 3
For a long time, vllm.ai simply redirected to the vLLM GitHub page. Thanks to our community, we now have a brand-new vllm.ai website, drawing inspiration from the PyTorch website.
Watchlist Matched: none
vLLM Project · open-source · 2025-12-14
Score 3
Your LLM just called a tool, received accurate data, and still got the answer wrong. Welcome to the world of extrinsic hallucination—where models confidently ignore the ground truth sitting right...
Watchlist Matched: none