MLSys Radar

vLLM Project

Official blog for vLLM, a high-throughput LLM serving engine.

Country
Unknown
Category
open-source
Blog
https://vllm.ai/blog
Feed
Feed discovery status
pending

vLLM Project · open-source · 2026-05-28

Native RL APIs in vLLM

Score 11

As post-training workloads continue to scale, we've seen widespread adoption of vLLM as the inference engine of choice. However, two issues repeatedly arise:

inference training

Open

High signal Matched: inference, training, post-training

vLLM Project · open-source · 2026-05-14

Elastic Expert Parallelism in vLLM

Score 16

Expert parallelism (EP) is a key technique for serving Mixture-of-Experts (MoE) models at high throughput. WideEP deployments (where EP spans many workers) maximize KV cache capacity, enabling...

inference serving kv-cache moe benchmark

Open

High signal Matched: serving, throughput, kv cache, moe

vLLM Project · open-source · 2026-04-14

vLLM Korea Meetup 2026 Wrap-Up

Score 16

Hosted by the vLLM KR Community, with support from Rebellions, SqueezeBits, Red Hat APAC, and PyTorch Korea, the vLLM Korea Meetup 2026 was held in Seoul on April 2nd.

korea

Open

High signal Matched: korea, seoul, rebellions

vLLM Project · open-source · 2026-03-24

Model Runner V2: A Modular and Faster Core for vLLM

Score 12

We are excited to announce Model Runner V2 (MRV2), a ground-up re-implementation of the vLLM model runner. MRV2 delivers a cleaner, more modular, and more efficient execution core—with no API...

model-release api

Open

High signal Matched: model, api

vLLM Project · open-source · 2026-03-04

vLLM Triton Attention Backend Deep Dive

Score 14

This article is adapted from a Red Hat hosted vLLM Office Hours session with Burkhard Ringlein from IBM Research, featuring a deep technical walkthrough of the vLLM Triton attention backend....

kernel triton research

Open

High signal Matched: triton, research

vLLM Project · open-source · 2026-01-31

Streaming Requests & Realtime API in vLLM

Score 12

Large language model inference has traditionally operated on a simple premise: the user submits a complete prompt (request), the model processes it, and returns a response (either streaming or at...

inference model-release api

Open

High signal Matched: inference, model, api

vLLM Project · open-source · 2025-12-13

Diving into speculative decoding training support for vLLM with Speculators v0.3.0

Score 24

- Speculative decoding serves as an optimization to improve inference performance; however, training a unique draft model for each LLM can be difficult and time-consuming, while production-ready...

inference speculative-decoding benchmark model-release training

Open

High signal Matched: inference, decoding, speculative decoding, draft model, performance, model, training

vLLM Project · open-source · 2026-03-30

Extracting hidden states from vLLM

Score 3

PR #33736 (included in vllm>=v0.18.0) introduced a new hidden states extraction system to vLLM. This blog post explores the motivation, design, usage, and future direction of this feature, and its...

Open

Watchlist Matched: none

vLLM Project · open-source · 2025-12-27

Announcing vllm.ai Website and Some Community Updates

Score 3

For a long time, vllm.ai simply redirected to the vLLM GitHub page. Thanks to our community, we now have a brand-new vllm.ai website, drawing inspiration from the PyTorch website.

Open

Watchlist Matched: none