vLLM Project - MLSys Blogs

vLLM Project · open-source · 2026-06-02

Session-Aware Agentic Routing: Continuity-Aware Model Selection for Long-Horizon LLM Agents

Score 15

Long-horizon LLM agents create a routing problem that single-turn prompt routers were not designed to solve. A router still needs to know which model is best for the current request, but it also...

moe model-release agents

Open

High signal Matched: router, model, agents, agentic

vLLM Project · open-source · 2026-06-02

Accelerating vLLM-Omni Inference with AutoRound Quantization

Score 13

We are excited to announce that AutoRound — Intel's state-of-the-art post-training quantization (PTQ) algorithm — is now fully integrated into vLLM-Omni, enabling a streamlined quantize-once,...

inference training quantization

Open

High signal Matched: inference, training, post-training, quantization

vLLM Project · open-source · 2026-06-01

vLLM on the DGX Spark: Architecture, Configuration, and Local Evaluation

Score 17

A technical deep dive on running vLLM on NVIDIA DGX Spark and GB10 systems, covering sm_121 architecture, unified memory behavior, NVFP4 model serving, Nemotron-3-Super configuration, Docker deployment, Prometheus metrics, and local evalua...

inference serving model-release research evals

Open

High signal Matched: serving, model, evaluation

vLLM Project · open-source · 2026-05-28

Speculators v0.5.0: DFlash Support and Online Training

Score 19

The v0.5.0 release brings significant architectural improvements to speculative decoding model training, introducing DFlash algorithm support, fully unified online training capabilities, and a...

inference speculative-decoding model-release training

Open

High signal Matched: decoding, speculative decoding, release, introducing, model, training

vLLM Project · open-source · 2026-05-28

From Text to Multimodal Routing: Hardening Vision Signals in vLLM Semantic Router

Score 19

Most routing systems start with a prompt and choose a model endpoint. vLLM Semantic Router (VSR) makes a different bet: before a request reaches the serving model, the system should extract...

inference serving moe model-release api

Open

High signal Matched: serving, endpoint, router, model

vLLM Project · open-source · 2026-05-28

Accelerating Laguna XS.2 Inference with vLLM, Speculators, and LLM Compressor

Score 15

As organizations increasingly adopt AI-powered development tools, the need for high-performance agentic models that deliver both accuracy and operational efficiency has become critical. Laguna...

inference benchmark agents

Open

High signal Matched: inference, performance, agentic

vLLM Project · open-source · 2026-05-28

Native RL APIs in vLLM

Score 11

As post-training workloads continue to scale, we've seen widespread adoption of vLLM as the inference engine of choice. However, two issues repeatedly arise:

inference training

Open

High signal Matched: inference, training, post-training

vLLM Project · open-source · 2026-05-26

EAGLE 3.1: Advancing Speculative Decoding Through Collaboration Between the EAGLE Team, vLLM, and TorchSpec

Score 22

The EAGLE series — including EAGLE 1, EAGLE 2, and EAGLE 3 — has become one of the most widely adopted and practically deployed families of speculative decoding algorithms across both research and...

inference speculative-decoding research

Open

High signal Matched: decoding, speculative decoding, eagle, research

vLLM Project · open-source · 2026-05-18

vLLM x Novita AI: PegaFlow for Production-Grade External KV Cache

Score 14

TL;DR: In collaboration with Novita AI, PegaFlow integrates with vLLM as an external KV cache service for LLM inference, implemented as a standalone Rust process and connected through the external...

inference kv-cache

Open

High signal Matched: inference, kv cache

vLLM Project · open-source · 2026-05-14

Elastic Expert Parallelism in vLLM

Score 16

Expert parallelism (EP) is a key technique for serving Mixture-of-Experts (MoE) models at high throughput. WideEP deployments (where EP spans many workers) maximize KV cache capacity, enabling...

inference serving kv-cache moe benchmark

Open

High signal Matched: serving, throughput, kv cache, moe

vLLM Project · open-source · 2026-05-14

Announcing VeRL-Omni: Easy, Fast, and Stable RL Training for Diffusion and Omni-Modality Models

Score 10

We are excited to announce the pre-release of VeRL-Omni, a general reinforcement learning (RL) post-training framework focused on multimodal generative models, built on top of verl and vllm-omni.

model-release training

Open

High signal Matched: release, training, post-training

vLLM Project · open-source · 2026-05-11

A First Comprehensive Study of TurboQuant: Accuracy and Performance

Score 14

TurboQuant, a method for KV-cache quantization, recently gained significant traction in the community due to the large advertised savings in GPU memory from very low bit-width quantization of a...

benchmark hardware quantization

Open

High signal Matched: performance, gpu, quantization

vLLM Project · open-source · 2026-05-06

Serving Agentic Workloads at Scale with vLLM x Mooncake

Score 18

TL;DR: Agentic workloads generate massive shared prefixes that are often recomputed across turns. By integrating Mooncake's distributed KV cache store into vLLM, we achieve 3.8x higher throughput,...

inference serving distributed kv-cache benchmark agents

Open

High signal Matched: serving, throughput, distributed, kv cache, agentic

vLLM Project · open-source · 2026-04-28

Run Highly Efficient Multimodal Agentic AI with NVIDIA Nemotron 3 Nano Omni Using vLLM

Score 10

We are excited to support the newly released NVIDIA Nemotron 3 Nano Omni model on vLLM.

model-release agents

Open

High signal Matched: model, agentic

vLLM Project · open-source · 2026-04-22

The State of FP8 KV-Cache and Attention Quantization in vLLM

Score 18

Long-context LLM serving is increasingly memory-bound: for standard full-attention decoders, the KV cache often dominates GPU memory at 128k+ contexts, and each decode step must read a large...

inference serving kv-cache hardware model-release quantization long-context

Open

High signal Matched: serving, kv cache, gpu, fp8, quantization, long-context

vLLM Project · open-source · 2026-04-21

Disaggregated Serving for Hybrid SSM Models in vLLM

Score 12

Hybrid architectures that interleave Mamba-style SSM layers with standard full-attention (FA) layers — such as NVIDIA Nemotron-H — are gaining traction as a way to combine the linear-time...

inference serving

Open

High signal Matched: serving

vLLM Project · open-source · 2026-04-14

vLLM Korea Meetup 2026 Wrap-Up

Score 16

Hosted by the vLLM KR Community, with support from Rebellions, SqueezeBits, Red Hat APAC, and PyTorch Korea, the vLLM Korea Meetup 2026 was held in Seoul on April 2nd.

korea

Open

High signal Matched: korea, seoul, rebellions

vLLM Project · open-source · 2026-04-07

Next-Level Inference: Why Your Single-Node vLLM Setup Needs Prefill-Decode Disaggregation

Score 22

TL;DR: Prefill and decode fight over the same GPUs, causing ITL spikes under load. We show how to disaggregate them on a single 8-GPU MI300X node using AMD's MORI-IO connector — achieving 2.5x...

inference benchmark hardware

Open

High signal Matched: inference, prefill, itl, gpu, mi300x

vLLM Project · open-source · 2026-04-02

Announcing Gemma 4 on vLLM: Byte for byte, the most capable open models

Score 16

With the debut of Gemma 4, vLLM introduces immediate support for Google's most sophisticated open model lineup, spanning multiple hardware backends, with first-ever Day 0 support on Google TPUs,...

model-release open-source

Open

High signal Matched: model, open model

vLLM Project · open-source · 2026-03-24

Model Runner V2: A Modular and Faster Core for vLLM

Score 12

We are excited to announce Model Runner V2 (MRV2), a ground-up re-implementation of the vLLM model runner. MRV2 delivers a cleaner, more modular, and more efficient execution core—with no API...

model-release api

Open

High signal Matched: model, api

vLLM Project · open-source · 2026-03-13

P-EAGLE: Faster LLM inference with Parallel Speculative Decoding in vLLM

Score 26

EAGLE is the state-of-the-art method for speculative decoding in large language model (LLM) inference, but its autoregressive drafting creates a hidden bottleneck: the more tokens that you...

inference speculative-decoding model-release

Open

High signal Matched: inference, decoding, speculative decoding, eagle, model

vLLM Project · open-source · 2026-03-11

Run Highly Efficient and Accurate Multi-Agent AI with NVIDIA Nemotron 3 Super Using vLLM

Score 10

We are excited to support the newly released NVIDIA Nemotron 3 Super model on vLLM.

model-release agents

Open

High signal Matched: model, agent

vLLM Project · open-source · 2026-03-10

vLLM Semantic Router v0.2 Athena: ClawOS, Model Refresh, and the System Brain

Score 18

Since v0.1 Iris, vLLM Semantic Router has made a large jump. In one release cycle, the project rebuilt its model stack, expanded routing into safety, semantic caching, memory, retrieval, and...

moe model-release rag

Open

High signal Matched: router, release, model, retrieval

vLLM Project · open-source · 2026-03-04

vLLM Triton Attention Backend Deep Dive

Score 14

This article is adapted from a Red Hat hosted vLLM Office Hours session with Burkhard Ringlein from IBM Research, featuring a deep technical walkthrough of the vLLM Triton attention backend....

kernel triton research

Open

High signal Matched: triton, research

vLLM Project · open-source · 2026-02-27

Beyond Porting: How vLLM Orchestrates High-Performance Inference on AMD ROCm

Score 20

For a long time, enabling AMD support meant "porting"; i.e. just making code run. That era is over.

inference benchmark hardware

Open

High signal Matched: inference, performance, rocm

vLLM Project · open-source · 2026-02-26

Efficiently serve dozens of fine-tuned models with vLLM on Amazon SageMaker AI and Amazon Bedrock

Score 30

Organizations and individuals running multiple custom AI models, especially recent Mixture of Experts (MoE) model families, can face the challenge of paying for idle GPU capacity when the...

serving moe hardware model-release cloud

Open

High signal Matched: serve, moe, mixture of experts, gpu, model, sagemaker, bedrock

vLLM Project · open-source · 2026-02-13

DeepSeek-V3.2 on GB300: Performance Breakthrough

Score 22

DeepSeek-V3.2 (NVFP4 + TP2)has been successfully and smoothly run on GB300 (SM103 - Blackwell Ultra). Leveraging FP4 quantization, it achieves a single-GPU throughput of 7360 TGS (tokens / GPU /...

serving moe benchmark hardware quantization

Open

High signal Matched: throughput, deepseek-v3, performance, gpu, blackwell, quantization

vLLM Project · open-source · 2026-02-03

Driving vLLM WideEP and Large-Scale Serving Toward Maturity on Blackwell (Part I)

Score 24

Building on our previous work achieving 2.2k tok/s/H200 decode throughput with wide-EP, the vLLM team has continued performance optimization efforts targeting NVIDIA's GB200 platform. This blog...

inference serving benchmark hardware

Open

High signal Matched: serving, throughput, performance, h200, gb200, blackwell

vLLM Project · open-source · 2026-02-01

GPT-OSS Performance Optimizations on NVIDIA Blackwell: Pushing the Pareto Frontier

Score 18

TL;DR: In collaboration with the open-source community, vLLM + NVIDIA has achieved significant performance milestones on the gpt-oss-120b model running on NVIDIA's Blackwell GPUs. Through deep...

benchmark hardware model-release open-source

Open

High signal Matched: performance, blackwell, model, open-source, oss

vLLM Project · open-source · 2026-01-31

Streaming Requests & Realtime API in vLLM

Score 12

Large language model inference has traditionally operated on a simple premise: the user submits a complete prompt (request), the model processes it, and returns a response (either streaming or at...

inference model-release api

Open

High signal Matched: inference, model, api

vLLM Project · open-source · 2026-01-08

Inside vLLM’s New KV Offloading Connector: Smarter Memory Transfer for Maximizing Inference Throughput

Score 18

In this post, we will describe the new KV cache offloading feature that was introduced in vLLM 0.11.0. We will focus on offloading to CPU memory (DRAM) and its benefits to improving overall...

inference serving kv-cache benchmark

Open

High signal Matched: inference, throughput, kv cache

vLLM Project · open-source · 2026-01-05

vLLM Semantic Router v0.1 Iris: The First Major Release

Score 16

vLLM Semantic Router is the System Level Intelligence for Mixture-of-Models (MoM), bringing Collective Intelligence into LLM systems. It lives between users and models, capturing signals from...

moe model-release

Open

High signal Matched: router, release

vLLM Project · open-source · 2026-01-02

Introducing vLLM Playground: A Modern Web Interface for Managing and Interacting with vLLM Servers

Score 12

As a passionate vLLM community member who wants to see vLLM thrive and reach even more developers, I'm excited to announce vLLM Playground – a modern, feature-rich web interface for managing and...

model-release

Open

High signal Matched: introducing

vLLM Project · open-source · 2025-12-19

vLLM-Omni Diffusion Cache Acceleration

Score 10

We are thrilled to announce a major performance update for vLLM-Omni.

benchmark

Open

High signal Matched: performance

vLLM Project · open-source · 2025-12-17

vLLM Large Scale Serving: DeepSeek @ 2.2k tok/s/H200 with Wide-EP

Score 16

In v0.11.0, the last code from vLLM V0 engine was removed, marking the complete migration to the improved V1 engine architecture. This achievement would not have been possible without vLLM’s...

inference serving hardware

Open

High signal Matched: serving, h200

vLLM Project · open-source · 2025-12-16

AMD × vLLM Semantic Router: Building the System Intelligence Together

Score 14

Over the past several months, AMD and the vLLM SR Team have been collaborating to bring vLLM Semantic Router (VSR) to AMD GPUs—not just as a performance optimization, but as a fundamental shift in...

moe benchmark

Open

High signal Matched: router, performance

vLLM Project · open-source · 2025-12-15

Encoder Disaggregation for Scalable Multimodal Model Serving

Score 18

Modern Large Multimodal Models (LMMs) introduce a unique serving-time bottleneck: before any text generation can begin, all images must be processed by a visual encoder (e.g., ViT). This encoder...

inference serving model-release

Open

High signal Matched: serving, generation, model

vLLM Project · open-source · 2025-12-15

Run Highly Efficient and Accurate AI Agents with NVIDIA Nemotron 3 Nano on vLLM

Score 10

Jan 28th Update: NVIDIA just released their Nemotron 3 Nano model in NVFP4 precision. This model is supported by vLLM out of the box and it uses a new method called Quantization-Aware Distillation...

model-release quantization agents

Open

High signal Matched: model, quantization, agents

vLLM Project · open-source · 2025-12-13

vLLM Router: A High-Performance and Prefill/Decode Aware Load Balancer for Large-scale Serving

Score 26

Efficiently managing request distribution across a fleet of model replicas is a critical requirement for large-scale, production vLLM deployments. Standard load balancers often fall short as they...

inference serving moe benchmark model-release

Open

High signal Matched: serving, prefill, router, performance, model

vLLM Project · open-source · 2025-12-13

Diving into speculative decoding training support for vLLM with Speculators v0.3.0

Score 24

- Speculative decoding serves as an optimization to improve inference performance; however, training a unique draft model for each LLM can be difficult and time-consuming, while production-ready...

inference speculative-decoding benchmark model-release training

Open

High signal Matched: inference, decoding, speculative decoding, draft model, performance, model, training

vLLM Project · open-source · 2025-12-09

Advancing Low‑Bit Quantization for LLMs: AutoRound x LLM Compressor

Score 10

Achieve faster, more efficient LLM serving without sacrificing accuracy!

inference serving quantization

Open

High signal Matched: serving, quantization

vLLM Project · open-source · 2025-12-03

Tracing Hanging and Complicated GPU Kernels Down To The Source Code

Score 16

Several months ago, we published a blog post about CUDA Core Dump: An Effective Tool to Debug Memory Access Issues and Beyond, introducing a powerful technique for debugging illegal memory access...

kernel cuda hardware model-release

Open

High signal Matched: cuda, gpu, introducing

vLLM Project · open-source · 2025-11-30

Announcing vLLM-Omni: Easy, Fast, and Cheap Omni-Modality Model Serving

Score 20

We are excited to announce the official release of vLLM-Omni, a major extension of the vLLM ecosystem designed to support the next generation of AI: omni-modality models.

inference serving model-release

Open

High signal Matched: serving, generation, release, model

vLLM Project · open-source · 2025-11-22

Streamlined multi-node serving with Ray symmetric-run

Score 18

Ray now has a new command: ray symmetric-run. This command makes it possible to launch the same entrypoint command on every node in a Ray cluster, simplifying the workflow to spawn vLLM servers...

inference serving distributed model-release

Open

High signal Matched: serving, multi-node, launch

vLLM Project · open-source · 2026-05-11

vLLM Tops the Artificial Analysis Leaderboard

Score 3

How vLLM built the leading deployments of DeepSeek V3.2, MiniMax-M2.5, and Qwen 3.5 397B.

evals

Open

Watchlist Matched: leaderboard

vLLM Project · open-source · 2026-04-24

DeepSeek V4 in vLLM: Efficient Long-context Attention

Score 3

A first-principles walkthrough of DeepSeek V4's long-context attention, and how we implemented it in vLLM.

long-context

Open

Watchlist Matched: long-context

vLLM Project · open-source · 2026-03-30

Extracting hidden states from vLLM

Score 3

PR #33736 (included in vllm>=v0.18.0) introduced a new hidden states extraction system to vLLM. This blog post explores the motivation, design, usage, and future direction of this feature, and its...

Open

Watchlist Matched: none

vLLM Project · open-source · 2026-01-23

Building Mixture-of-Models on AMD GPUs with vLLM-SR

Score 3

We are working on building the System Level Intelligence for Mixture-of-Models (MoM), bringing Collective Intelligence into LLM systems.

Open

Watchlist Matched: none

vLLM Project · open-source · 2025-12-27

Announcing vllm.ai Website and Some Community Updates

Score 3

For a long time, vllm.ai simply redirected to the vLLM GitHub page. Thanks to our community, we now have a brand-new vllm.ai website, drawing inspiration from the PyTorch website.

Open

Watchlist Matched: none

vLLM Project · open-source · 2025-12-14

Token-Level Truth: Real-Time Hallucination Detection for Production LLMs

Score 3

Your LLM just called a tool, received accurate data, and still got the answer wrong. Welcome to the world of extrinsic hallucination—where models confidently ignore the ground truth sitting right...

Open

Watchlist Matched: none