serving - MLSys Blogs

How Together served MiniMax-M3 efficiently with KV-block-major sparse attention, paged MSA decode, optimized index scoring, and a Rust-based multimodal gateway.

inference serving

Open

High signal Matched: inference, serving

Lambda · cloud · 2026-06-01

Unbox one of NVIDIA's first co-packaged optics switches with us. See why we bet on CPO early.

Score 15

When we design large GPU clusters, the network is no longer a background system. It's part of the compute envelope. At the 800G and NVIDIA GB300 NVL72 scale, the back-end fabric accounts for 86% of networking power in a three-layer cluster...

inference serving distributed benchmark hardware model-release rag agents

Open

High signal Matched: generation, token generation, throughput, infiniband, gpu, model, retrieval, agentic

vLLM Project · open-source · 2026-06-01

vLLM on the DGX Spark: Architecture, Configuration, and Local Evaluation

Score 17

A technical deep dive on running vLLM on NVIDIA DGX Spark and GB10 systems, covering sm_121 architecture, unified memory behavior, NVFP4 model serving, Nemotron-3-Super configuration, Docker deployment, Prometheus metrics, and local evalua...

inference serving model-release research evals

Open

High signal Matched: serving, model, evaluation

NVIDIA Technical Blog · hardware · 2026-05-29

DynoSim: Simulating the Pareto Frontier

Score 15

Modern LLM serving is hard to tune because each deployment is a stack of interacting choices: model backend, tensor-parallel shape, prefill/decode split, worker...

inference serving model-release

Open

High signal Matched: serving, prefill, model

Nota AI · korea · 2026-05-29

Full-Stack Optimization for Low-Light Video on Jetson Orin NX: From 400 ms to 28 ms

Score 23

  Jaehoon Lee Technical Content Manager, Nota AI   When enterprises adopt AI, the most common bottleneck is not model development. It is the deployment stage: getting a finished model to run reliably on the actual target device.T...

inference serving benchmark hardware model-release research quantization evals

Open

High signal Matched: inference, throughput, benchmark, performance, latency, cost, gpu, model, evaluation, quantization, int8, benchmarks, leaderboard

vLLM Project · open-source · 2026-05-28

From Text to Multimodal Routing: Hardening Vision Signals in vLLM Semantic Router

Score 19

Most routing systems start with a prompt and choose a model endpoint. vLLM Semantic Router (VSR) makes a different bet: before a request reaches the serving model, the system should extract...

inference serving moe model-release api

Open

High signal Matched: serving, endpoint, router, model

Lambda · cloud · 2026-05-22

DeepSeek V4: the most expected open-source model ever released, and the quietest landing

Score 18

After 15 months of incremental updates, leaks, and rumored leaks, DeepSeek released version 4. It arrived without the fanfare R1 and R1-preview commanded in early 2025. That quiet reception is the most interesting thing about the release....

inference serving benchmark model-release open-source

Open

High signal Matched: inference, serving, performance, cost, release, model, open-source

AMD ROCm Blogs · hardware · 2026-05-22

From Build to Benchmark: ONNX Model Serving with Triton Inference Server on AMD GPUs

Score 30

Triton Inference Server is an open-source platform designed to streamline AI inferencing. It supports the deployment, scaling, and inference of trained models from multiple frameworks, including ONNX Runtime, TensorFlow, PyTorch, and other...

inference serving kernel triton benchmark model-release cloud open-source

Open

High signal Matched: inference, inferencing, serving, triton, benchmark, model, cloud, open-source

Lambda · cloud · 2026-05-21

Lambda Bare Metal Instances: full hardware control with API-driven operations

Score 8

The unit of AI compute has shifted from single hosts to rack-scale systems that integrate NVIDIA GPUs, CPUs, scale-up networking fabrics, and liquid cooling, such as the NVIDIA GB300 NVL72 and NVIDIA Vera Rubin NVL72. Teams at the frontier...

inference serving benchmark cloud training api

Open

High signal Matched: serving, performance, cloud, training, api

LMCache · open-source · 2026-05-21

OpenAI API Is the New IPv4

Score 10

A new system stack is quietly taking shape around LLM serving. What makes it interesting is not just how quickly it is evolving, but how familiar the shape of that evolution looks if you’ve spent time studying large-scale systems like the...

inference serving kv-cache api

Open

High signal Matched: serving, lmcache, api

Together AI · inference-infra · 2026-05-15

Together AI and Pearl Research Labs Team Up to Reduce the Cost of AI Inference

Score 24

Together AI partners with Pearl Research Labs to launch a discounted Pearl-powered inference endpoint for Gemma-4-31B-it-pearl, using Proof of Useful Work to turn AI workloads into crypto emissions.

inference serving benchmark model-release research api

Open

High signal Matched: inference, endpoint, cost, launch, research

vLLM Project · open-source · 2026-05-14

Elastic Expert Parallelism in vLLM

Score 16

Expert parallelism (EP) is a key technique for serving Mixture-of-Experts (MoE) models at high throughput. WideEP deployments (where EP spans many workers) maximize KV cache capacity, enabling...

inference serving kv-cache moe benchmark

Open

High signal Matched: serving, throughput, kv cache, moe

NVIDIA Technical Blog · hardware · 2026-05-12

How to Eliminate Pipeline Friction in AI Model Serving

Score 16

The path from a trained AI model to production should be smooth, but rarely is. Many teams invest weeks fine-tuning models, only to discover that exporting to a...

inference serving model-release fine-tuning

Open

High signal Matched: serving, model, fine-tuning

Nota AI · korea · 2026-05-11

[NetsPresso® x AI Agents] Easier to Use, Even More Powerful

Score 52

  Jaehoon Lee Technical Content Manager, Nota AI   NetsPresso® now embraces AI agents. An easy-to-use interface sits on top of the validated pipeline that handles everything from model compression to device deployment.When a user...

inference serving kernel speculative-decoding moe benchmark hardware model-release research quantization evals agents api

Open

High signal Matched: inference, endpoint, kernel, verification, moe, benchmark, latency, cost, gpu, release, model, evaluation, quantization, quantized, int4, evaluate, benchmarks, swe-bench, mmlu, agent, agents, api

Together AI · inference-infra · 2026-05-11

Serving DeepSeek-V4: why million-token context is an inference systems problem

Score 22

DeepSeek-V4 makes million-token context a serving-systems problem. Together AI explores the inference work behind V4 on NVIDIA HGX B200, including compressed KV layouts, prefix caching, kernel maturity, and endpoint profiles for long-conte...

inference serving kernel hardware long-context api

Open

High signal Matched: inference, serving, endpoint, kernel, b200, long-context

BAIR · research · 2026-05-08

Adaptive Parallel Reasoning: The Next Paradigm in Efficient Inference Scaling

Score 28

.apr-fig { text-align: center; margin: 1.35em 0; line-height: 1.4; } .apr-fig--wide img { display: inline-block; width: 100%; max-width: 100%; height: auto; vertical-align: middle; } .apr-fig--wide-0-8 { max-width: 80%; margin-left: auto;...

inference serving kv-cache speculative-decoding benchmark model-release research training fine-tuning evals long-context agents frontier-model

Open

High signal Matched: inference, decoding, prefill, generation, serve, throughput, kv cache, verification, performance, latency, cost, model, paper, research, evaluation, training, pretraining, sft, benchmarks, long context, context window, agentic, reasoning model

vLLM Project · open-source · 2026-05-06

Serving Agentic Workloads at Scale with vLLM x Mooncake

Score 18

TL;DR: Agentic workloads generate massive shared prefixes that are often recomputed across turns. By integrating Mooncake's distributed KV cache store into vLLM, we achieve 3.8x higher throughput,...

inference serving distributed kv-cache benchmark agents

Open

High signal Matched: serving, throughput, distributed, kv cache, agentic

Cloudflare Blog · cloud · 2026-05-01

Introducing Dynamic Workflows: durable execution that follows the tenant

Score 10

Dynamic Workflows is a library that lets you route durable execution to tenant-provided code on the fly. Built on Dynamic Workers, it enables platforms to serve millions of unique workflows at near-zero idle cost.

serving benchmark model-release

Open

High signal Matched: serve, cost, introducing

vLLM Project · open-source · 2026-04-22

The State of FP8 KV-Cache and Attention Quantization in vLLM

Score 18

Long-context LLM serving is increasingly memory-bound: for standard full-attention decoders, the KV cache often dominates GPU memory at 128k+ contexts, and each decode step must read a large...

inference serving kv-cache hardware model-release quantization long-context

Open

High signal Matched: serving, kv cache, gpu, fp8, quantization, long-context

vLLM Project · open-source · 2026-04-21

Disaggregated Serving for Hybrid SSM Models in vLLM

Score 12

Hybrid architectures that interleave Mamba-style SSM layers with standard full-attention (FA) layers — such as NVIDIA Nemotron-H — are gaining traction as a way to combine the linear-time...

inference serving

Open

High signal Matched: serving

NVIDIA Technical Blog · hardware · 2026-04-20

Run High-Throughput Reinforcement Learning Training with End-to-End FP8 Precision

Score 18

As LLMs transition from simple text generation to complex reasoning, reinforcement learning (RL) plays a central role. Algorithms like Group Relative Policy...

inference serving benchmark model-release training quantization

Open

High signal Matched: generation, throughput, fp8, training

NVIDIA Technical Blog · hardware · 2026-04-20

Mitigating Indirect AGENTS.md Injection Attacks in Agentic Environments

Score 10

AI tools are significantly accelerating software development and changing how developers work with code. These tools serve as real-time copilots, automating...

serving agents

Open

High signal Matched: serve, agents, agentic

LMCache · open-source · 2026-04-04

LMCache’s New Architecture Boosts MoE Inference Performance by 10×

Score 34

Modern LLM serving workloads are defined by strict latency requirements, high concurrency, and rapidly growing context lengths. Applications such as multi-turn chat, AI agents, and retrieval-augmented generation continuously build on prior...

inference serving kv-cache moe benchmark rag agents

Open

High signal Matched: inference, serving, decoding, generation, throughput, lmcache, moe, performance, latency, ttft, retrieval-augmented generation, retrieval, agents

NVIDIA Technical Blog · hardware · 2026-04-02

Accelerating Vision AI Pipelines with Batch Mode VC-6 and NVIDIA Nsight

Score 14

In vision AI systems, model throughput continues to improve. The surrounding pipeline stages must keep pace, including decode, preprocessing, and GPU...

serving benchmark hardware model-release

Open

High signal Matched: throughput, gpu, model

NVIDIA Technical Blog · hardware · 2026-04-01

NVIDIA Platform Delivers Lowest Token Cost Enabled by Extreme Co-Design

Score 14

Co-designed hardware, software, and models are key to delivering the highest AI factory throughput and lowest token cost. Measuring this goes far beyond peak...

serving benchmark

Open

High signal Matched: throughput, cost

Nota AI · korea · 2026-03-31

The Real Reason TurboQuant Shook the Market: AI Optimization Has Gone Mainstream

Score 46

  Jaehoon Lee Technical Content Manager, Nota AI   In March, a single official announcement from Google Research rocked trillions of won in the market capitalization of U.S. infrastructure and semiconductor stocks. The catalyst:...

inference serving kv-cache benchmark hardware model-release research training fine-tuning quantization agents frontier-model

Open

High signal Matched: inference, serving, generation, throughput, kv cache, benchmark, performance, cost, b200, blackwell, introducing, model, fp8, research, training, fine-tuning, quantization, quantized, agent, agentic, frontier model

NVIDIA Technical Blog · hardware · 2026-03-25

Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads

Score 18

In production Kubernetes environments, the difference between model requirements and GPU size creates inefficiencies. Lightweight automatic speech recognition...

serving benchmark hardware model-release

Open

High signal Matched: throughput, gpu, model

Nota AI · korea · 2026-03-23

[GTC 2026 Recap] The Trillion-Dollar Inference Race Begins: How Nota AI Fills the Gap

Score 42

  Jaehoon Lee Technical Content Manager, Nota AI   GTC has evolved far beyond a technology conference, drawing attention from global economies and financial markets alike. This year, CEO Jensen Huang took the stage in his tradema...

inference serving kernel cuda kv-cache benchmark hardware model-release research cloud training long-context agents open-source

Open

High signal Matched: inference, prefill, generation, throughput, cuda, kv cache, performance, latency, cost, gpu, npu, launch, model, research, cloud, training, long-context, context window, agent, agents, agentic, open-source

NVIDIA Technical Blog · hardware · 2026-03-23

Deploying Disaggregated LLM Inference Workloads on Kubernetes

Score 18

As large language model (LLM) inference workloads grow in complexity, a single monolithic serving process starts to hit its limits. Prefill and decode stages...

inference serving model-release

Open

High signal Matched: inference, serving, prefill, model

Together AI · inference-infra · 2026-03-18

Together AI expands fine-tuning service with tool calling, reasoning, and vision support

Score 14

Together AI expands fine-tuning with native support for tool call, reasoning, and vision-language models, plus 100B+ model training, up to 6× higher throughput, and job cost and ETA estimates.

serving benchmark model-release training fine-tuning

Open

High signal Matched: throughput, cost, model, training, fine-tuning

Hugging Face · open-source · 2026-03-17

Holotron-12B - High Throughput Computer Use Agent

Score 10

No feed summary available yet.

serving benchmark agents

Open

High signal Matched: throughput, agent, computer use

Nota AI · korea · 2026-03-13

NotaMoEQuantization: An MoE-Specific Quantization Method for Solar-Open-100B

Score 62

  Hancheol Park, Ph. D. AI Research Engineer, Nota AI Tairen PiaoAI Research Engineer, Nota AI Tae-Ho KimCTO & Co-Founder, Nota AI ✔️ Resource : The official quantized model of Solar-Open-100B, which passed the first round of Sout...

inference serving moe benchmark hardware model-release research korea training quantization evals long-context open-source

Open

High signal Matched: inference, serving, prefill, generation, throughput, moe, router, benchmark, performance, latency, ttft, tpot, blackwell, release, model, weights, open model, research, evaluation, korea, korean, upstage, training, post-training, quantization, quantized, int4, evaluate, benchmarks, mmlu, long-context

BAIR · research · 2026-03-13

Identifying Interactions at Scale for LLMs

Score 18

--> Understanding the behavior of complex machine learning systems, particularly Large Language Models (LLMs), is a critical challenge in modern artificial intelligence. Interpretability research aims to make the decision-making process mo...

inference serving benchmark model-release research training evals long-context rag

Open

High signal Matched: inference, serving, decoding, performance, cost, model, research, training, evaluate, mmlu, long-context, rag

Together AI · inference-infra · 2026-03-05

FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling

Score 20

As GPU throughput outpaces memory bandwidth, kernels must evolve. We introduce FlashAttention-4, featuring new pipelining for maximum overlap, 2-CTA MMA modes to reduce shared memory traffic, and a hardware-software hybrid approach to soft...

serving kernel benchmark hardware

Open

High signal Matched: throughput, kernel, flashattention, gpu

Together AI · inference-infra · 2026-03-04

Cache-aware prefill–decode disaggregation (CPD) for up to 40% faster long-context LLM serving

Score 20

Serving long prompts doesn't have to mean slow responses. Learn how Together AI's CPD architecture separates warm and cold inference workloads to deliver 40% higher throughput and dramatically lower time-to-first-token for long-context LLM...

inference serving benchmark long-context

Open

High signal Matched: inference, serving, prefill, throughput, long-context

Modal · inference-infra · 2026-03-04

Product Updates: Directory Snapshots, GLM-5, billing updates and more

Score 8

A roundup of everything we shipped in February: Directory Snapshots for Sandboxes, a free GLM-5 endpoint, new billing API, and more.

serving api

Open

High signal Matched: endpoint, api

vLLM Project · open-source · 2026-02-26

Efficiently serve dozens of fine-tuned models with vLLM on Amazon SageMaker AI and Amazon Bedrock

Score 30

Organizations and individuals running multiple custom AI models, especially recent Mixture of Experts (MoE) model families, can face the challenge of paying for idle GPU capacity when the...

serving moe hardware model-release cloud

Open

High signal Matched: serve, moe, mixture of experts, gpu, model, sagemaker, bedrock

vLLM Project · open-source · 2026-02-13

DeepSeek-V3.2 on GB300: Performance Breakthrough

Score 22

DeepSeek-V3.2 (NVFP4 + TP2)has been successfully and smoothly run on GB300 (SM103 - Blackwell Ultra). Leveraging FP4 quantization, it achieves a single-GPU throughput of 7360 TGS (tokens / GPU /...

serving moe benchmark hardware quantization

Open

High signal Matched: throughput, deepseek-v3, performance, gpu, blackwell, quantization

Google Research · big-tech · 2026-02-11

Scheduling in a changing world: Maximizing throughput with time-varying capacity

Score 8

Algorithms & Theory

serving benchmark

Open

High signal Matched: throughput

llm-d · open-source · 2026-02-10

Native KV Cache Offloading to Any Filesystem with llm-d

Score 20

llm-d's new filesystem backend offloads KV cache to shared storage, enabling cross-replica reuse and up to 16.8x faster TTFT — scaling inference throughput without GPU or CPU memory limits.

inference serving kv-cache benchmark hardware

Open

High signal Matched: inference, throughput, kv cache, ttft, gpu

vLLM Project · open-source · 2026-02-03

Driving vLLM WideEP and Large-Scale Serving Toward Maturity on Blackwell (Part I)

Score 24

Building on our previous work achieving 2.2k tok/s/H200 decode throughput with wide-EP, the vLLM team has continued performance optimization efforts targeting NVIDIA's GB200 platform. This blog...

inference serving benchmark hardware

Open

High signal Matched: serving, throughput, performance, h200, gb200, blackwell

Together AI · inference-infra · 2026-01-22

Optimizing inference speed and costs: Lessons learned from large-scale deployments

Score 22

Learn how to reduce inference latency without massive cost using proven inference optimization tactics — improving throughput, GPU utilization, and cost efficiency while balancing throughput vs. latency tradeoffs.

inference serving benchmark hardware

Open

High signal Matched: inference, throughput, latency, cost, gpu

vLLM Project · open-source · 2026-01-08

Inside vLLM’s New KV Offloading Connector: Smarter Memory Transfer for Maximizing Inference Throughput

Score 18

In this post, we will describe the new KV cache offloading feature that was introduced in vLLM 0.11.0. We will focus on offloading to CPU memory (DRAM) and its benefits to improving overall...

inference serving kv-cache benchmark

Open

High signal Matched: inference, throughput, kv cache

SqueezeBits · korea · 2026-01-07

Intel® Gaudi® Hands-on Workshop | A Recap of the Gaudi Workshop with SqueezeBits x Lablup

Score 12

A recap of the Intel® Gaudi® hands-on workshop co-hosted by SqueezeBits and Lablup. AI model compression, fine-tuning, and vLLM serving on Gaudi® hardware with Backend.AI.

inference serving model-release fine-tuning

Open

High signal Matched: serving, model, fine-tuning

SqueezeBits · korea · 2025-12-24

Introducing rebellions ATOM™-MAX

Score 24

Introducing ATOM™-Max, rebellions’ next-generation NPU designed for high-performance AI inference. Learn how its runtime, profiling tools, and PyTorch-native integrations enable developers to run and serve models efficiently without sacrif...

inference serving benchmark hardware model-release korea

Open

High signal Matched: inference, generation, serve, performance, npu, introducing, rebellions

Nota AI · korea · 2025-12-19

NVIDIA Blackwell; The Impact of NVFP4 For LLM Inference

Score 74

  Seungmin YangEdgeFM Lead, Nota AI On this page ▾ SummaryWith the introduction of NVFP4—a new 4-bit floating point data type in NVIDIA’s Blackwell GPU architecture—LLM inference achieves markedly improved efficiency.Blackwell’s NVFP4...

inference serving kernel cuda distributed benchmark hardware model-release research training quantization evals rag

Open

High signal Matched: inference, serving, decoding, prefill, generation, token generation, throughput, kernel, gemm, cutlass, distributed, benchmark, performance, latency, ttft, tpot, tokens/sec, cost, gpu, blackwell, launch, model, weights, fp8, research, training, post-training, quantization, quantized, awq, benchmarks, mmlu, retrieval

vLLM Project · open-source · 2025-12-17

vLLM Large Scale Serving: DeepSeek @ 2.2k tok/s/H200 with Wide-EP

Score 16

In v0.11.0, the last code from vLLM V0 engine was removed, marking the complete migration to the improved V1 engine architecture. This achievement would not have been possible without vLLM’s...

inference serving hardware

Open

High signal Matched: serving, h200

vLLM Project · open-source · 2025-12-15

Encoder Disaggregation for Scalable Multimodal Model Serving

Score 18

Modern Large Multimodal Models (LMMs) introduce a unique serving-time bottleneck: before any text generation can begin, all images must be processed by a visual encoder (e.g., ViT). This encoder...

inference serving model-release

Open

High signal Matched: serving, generation, model

vLLM Project · open-source · 2025-12-13

vLLM Router: A High-Performance and Prefill/Decode Aware Load Balancer for Large-scale Serving

Score 26

Efficiently managing request distribution across a fleet of model replicas is a critical requirement for large-scale, production vLLM deployments. Standard load balancers often fall short as they...

inference serving moe benchmark model-release

Open

High signal Matched: serving, prefill, router, performance, model

vLLM Project · open-source · 2025-12-09

Advancing Low‑Bit Quantization for LLMs: AutoRound x LLM Compressor

Score 10

Achieve faster, more efficient LLM serving without sacrificing accuracy!

inference serving quantization

Open

High signal Matched: serving, quantization

vLLM Project · open-source · 2025-11-30

Announcing vLLM-Omni: Easy, Fast, and Cheap Omni-Modality Model Serving

Score 20

We are excited to announce the official release of vLLM-Omni, a major extension of the vLLM ecosystem designed to support the next generation of AI: omni-modality models.

inference serving model-release

Open

High signal Matched: serving, generation, release, model

AIBrix · open-source · 2025-11-26

PrisKV: A Colocated Tiered KVCache Store for LLM Serving

Score 22

In recent years, large language models (LLMs) such as GPT, DeepSeek, Doubao and Qwen have advanced rapidly and are reshaping a wide range of industries. As the Scaling Law continues to be validated and pushed to its limits, LLM capabilitie...

inference serving benchmark

Open

High signal Matched: inference, serving, generation, throughput, performance, latency, cost

vLLM Project · open-source · 2025-11-22

Streamlined multi-node serving with Ray symmetric-run

Score 18

Ray now has a new command: ray symmetric-run. This command makes it possible to launch the same entrypoint command on every node in a Ray cluster, simplifying the workflow to spawn vLLM servers...

inference serving distributed model-release

Open

High signal Matched: serving, multi-node, launch

Modal · inference-infra · 2025-10-29

Modal + Datalab: Deploy high-throughput document intelligence in <5 minutes

Score 10

We've collaborated with Datalab, the creators of Marker and Surya, to make it faster than ever to deploy document intelligence workflows.

serving benchmark

Open

High signal Matched: throughput

SqueezeBits · korea · 2025-10-28

[Intel Gaudi] #6. GEMM, Attention, vLLM on Gaudi

Score 20

Explore how Intel’s new Gaudi-3 compares to Gaudi-2, NVIDIA A100, and H100. We analyze real-world GEMM efficiency, attention performance, and LLM serving results to uncover what truly matters for AI inference and training workloads.

inference serving kernel benchmark hardware training

Open

High signal Matched: inference, serving, gemm, performance, h100, training

llm-d · open-source · 2025-09-24

KV-Cache Wins You Can See: From Prefix Caching in vLLM to Distributed Scheduling with llm-d

Score 18

See how llm-d's precise KV-cache aware scheduling delivers 57x faster responses and 2x throughput in production distributed LLM inference benchmarks.

inference serving distributed benchmark evals

Open

High signal Matched: inference, throughput, distributed, benchmarks

llm-d · open-source · 2025-09-03

Intelligent Inference Scheduling with llm-d

Score 16

Learn how llm-d's intelligent inference scheduling uses prefix-aware, load-balanced routing to maximize LLM throughput and minimize latency on Kubernetes.

inference serving benchmark

Open

High signal Matched: inference, throughput, latency

AIBrix · open-source · 2025-08-05

AIBrix v0.4.0 Release: P/D Disaggregation and Expert Parallelism Support, KVCache v1 Connector, KV Event Synchronization & Multi‑Engine Support

Score 20

AIBrix is a composable, cloud‑native LLM inference infrastructure designed to deliver high performance and low cost at scale. We now present a major update in a new release - v0.4.0. This release tackles key bottlenecks in orchestration an...

inference serving benchmark hardware model-release cloud

Open

High signal Matched: inference, prefill, generation, token generation, throughput, performance, cost, gpu, release, cloud

llm-d · open-source · 2025-06-25

llm-d Community Update - June 2025

Score 10

Help shape llm-d's future: Take our 5-minute community survey, subscribe to our YouTube channel, and access exclusive resources for LLM serving innovation.

inference serving

Open

High signal Matched: serving

llm-d · open-source · 2025-05-20

Announcing the llm-d community!

Score 20

Introducing llm-d: Kubernetes-native distributed LLM inference with KV-cache routing, disaggregated serving, and SOTA performance per dollar. Built on vLLM.

inference serving distributed benchmark model-release frontier-model

Open

High signal Matched: inference, serving, distributed, performance, introducing, sota

AIBrix · open-source · 2025-02-19

AIBrix v0.2.0 Release: Distributed KV Cache, Orchestration and Heterogeneous GPU Support

Score 42

We’re excited to announce the v0.2.0 release of AIBrix! Building on feedback from v0.1.0 production adoption and user interest, this release introduces several new features to enhance performance and usability. Extend the vLLM Prefix...

inference serving distributed kv-cache benchmark hardware model-release agents

Open

High signal Matched: inference, serving, prefill, throughput, distributed, multi-node, kv cache, prefix cache, performance, cost, gpu, accelerator, release, agent

Modular · inference-infra · 2025-02-06

Paged Attention & Prefix Caching Now Available in MAX Serve

Score 14

Paged Attention & Prefix Caching Now Available in MAX Serve

serving kv-cache

Open

High signal Matched: serve, paged attention

Modular · inference-infra · 2025-01-30

Agentic Building Blocks: Creating AI Agents with MAX Serve and OpenAI Function Calling

Score 10

Agentic Building Blocks: Creating AI Agents with MAX Serve and OpenAI Function Calling

serving agents

Open

High signal Matched: serve, agents, agentic, function calling

SqueezeBits · korea · 2025-01-20

[vLLM vs TensorRT-LLM] #13. Vision-Language Models

Score 8

This article provides a comparative analysis of serving vision-language models on vLLM and TensorRT-LLM.

inference serving

Open

High signal Matched: serving

Modular · inference-infra · 2024-12-17

MAX GPU: State of the Art Throughput on a New GenAI platform

Score 14

MAX GPU: State of the Art Throughput on a New GenAI platform

serving benchmark hardware frontier-model

Open

High signal Matched: throughput, gpu, state of the art

Modular · inference-infra · 2024-12-17

Build a Continuous Chat Interface with Llama 3 and MAX Serve

Score 10

Build a Continuous Chat Interface with Llama 3 and MAX Serve

serving

Open

High signal Matched: serve

SqueezeBits · korea · 2024-12-05

[vLLM vs TensorRT-LLM] #10 Serving Multiple LoRAs at Once

Score 14

This article provides a comparative analysis of multi-LoRA serving capabilities of vLLM and TensorRT-LLM frameworks.

inference serving fine-tuning

Open

High signal Matched: serving, lora

SqueezeBits · korea · 2024-10-11

[vLLM vs TensorRT-LLM] #2. Towards Optimal Batching for LLM Serving

Score 10

This article provides a comparative analysis of vLLM and TensorRT-LLM frameworks, focusing on batching configurations and thoroughly examining the effects of maximum batch size and maximum number of tokens.

inference serving

Open

High signal Matched: serving

SqueezeBits · korea · 2024-10-01

[vLLM vs TensorRT-LLM] #1. An Overall Evaluation

Score 22

This article provides a comparative analysis of vLLM and TensorRT-LLM frameworks for serving LLMs, evaluating their performance based on key metrics like throughput, TTFT, and TPOT to offer insights for practitioners in optimizing LLM depl...

inference serving benchmark research evals

Open

High signal Matched: serving, throughput, performance, ttft, tpot, evaluation, evaluating

Modal · inference-infra · 2024-09-16

Boost your throughput with dynamic batching

Score 14

Learn how we used our new dynamic batching feature to improve throughput and reduce inference costs for the Whisper model with a single line of code!

inference serving benchmark model-release

Open

High signal Matched: inference, throughput, model

Hugging Face · open-source · 2024-07-18

TGI Multi-LoRA: Deploy Once, Serve 30 Models

Score 10

No feed summary available yet.

serving fine-tuning

Open

High signal Matched: serve, lora

SkyPilot · open-source · 2024-07-11

AI on Kubernetes Without the Pain

Score 8

Develop, Train and Serve AI on Kubernetes with SkyPilot.

serving

Open

High signal Matched: serve

SkyPilot · open-source · 2024-02-20

Introducing SkyServe: 50% Cheaper AI Serving on Any Cloud with High Availability

Score 20

SkyServe: A simple, cost-efficient, multi-region/cloud library for serving GenAI models.

inference serving benchmark model-release cloud

Open

High signal Matched: serving, cost, introducing, cloud

SkyPilot · open-source · 2023-12-21

Scaling Mixtral LLM Serving with High GPU Availability and Cost Efficiency

Score 24

A tutorial for serving Mixtral 8x7B model with SkyPilot and SkyServe.

inference serving moe benchmark hardware model-release

Open

High signal Matched: serving, mixtral, cost, gpu, model

SkyPilot · open-source · 2023-06-29

Serving LLM 24x Faster On the Cloud with vLLM and SkyPilot

Score 14

SkyPilot makes the deployment and development of vLLM easy and fast on clouds.

inference serving cloud

Open

High signal Matched: serving, cloud

Hugging Face · open-source · 2022-08-11

Deploying 🤗 ViT on Kubernetes with TF Serving

Score 10

No feed summary available yet.

inference serving

Open

High signal Matched: serving

Hugging Face · open-source · 2022-07-25

Deploying TensorFlow Vision Models in Hugging Face with TF Serving

Score 10

No feed summary available yet.

inference serving

Open

High signal Matched: serving

Cloudflare Blog · cloud · 2026-05-07

When DNSSEC goes wrong: how we responded to the .de TLD outage

Score 4

On May 5, 2026, DENIC published broken DNSSEC signatures for the .de TLD, making millions of domains unreachable. Here's what 1.1.1.1 saw, how serve stale cushioned the impact, and how we restored resolution.

serving

Open

Watchlist Matched: serve

BAIR · research · 2025-03-25

Scaling Up Reinforcement Learning for Traffic Smoothing: A 100-AV Highway Deployment

Score 6

Training Diffusion Models with Reinforcement Learning We deployed 100 reinforcement learning (RL)-controlled cars into rush-hour highway traffic to smooth congestion and reduce fuel consumption for everyone. Our goal is to tackle "stop-and...

serving kernel benchmark model-release research training agents

Open

Watchlist Matched: throughput, kernel, performance, model, paper, training, agent, agents