NVIDIA Dynamo · open-source · 2026-06-03
Disaggregated Serving
No feed summary available yet.
High signal Matched: serving
NVIDIA Dynamo · open-source · 2026-06-03
No feed summary available yet.
High signal Matched: serving
Mooncake · open-source · 2026-06-03
No feed summary available yet.
High signal Matched: serving
Mooncake · open-source · 2026-06-03
No feed summary available yet.
High signal Matched: serving
BentoML · inference-infra · 2026-06-03
No feed summary available yet.
High signal Matched: inference, serve, performance, model
BentoML · inference-infra · 2026-06-03
No feed summary available yet.
High signal Matched: inference, serve
FuriosaAI · hardware · 2026-06-03
No feed summary available yet.
High signal Matched: throughput, furiosa, sdk
Together AI · inference-infra · 2026-06-02
How Together served MiniMax-M3 efficiently with KV-block-major sparse attention, paged MSA decode, optimized index scoring, and a Rust-based multimodal gateway.
High signal Matched: inference, serving
Lambda · cloud · 2026-06-01
When we design large GPU clusters, the network is no longer a background system. It's part of the compute envelope. At the 800G and NVIDIA GB300 NVL72 scale, the back-end fabric accounts for 86% of networking power in a three-layer cluster...
High signal Matched: generation, token generation, throughput, infiniband, gpu, model, retrieval, agentic
vLLM Project · open-source · 2026-06-01
A technical deep dive on running vLLM on NVIDIA DGX Spark and GB10 systems, covering sm_121 architecture, unified memory behavior, NVFP4 model serving, Nemotron-3-Super configuration, Docker deployment, Prometheus metrics, and local evalua...
High signal Matched: serving, model, evaluation
NVIDIA Technical Blog · hardware · 2026-05-29
Modern LLM serving is hard to tune because each deployment is a stack of interacting choices: model backend, tensor-parallel shape, prefill/decode split, worker...
High signal Matched: serving, prefill, model
Nota AI · korea · 2026-05-29
Jaehoon Lee Technical Content Manager, Nota AI When enterprises adopt AI, the most common bottleneck is not model development. It is the deployment stage: getting a finished model to run reliably on the actual target device.T...
High signal Matched: inference, throughput, benchmark, performance, latency, cost, gpu, model, evaluation, quantization, int8, benchmarks, leaderboard
vLLM Project · open-source · 2026-05-28
Most routing systems start with a prompt and choose a model endpoint. vLLM Semantic Router (VSR) makes a different bet: before a request reaches the serving model, the system should extract...
High signal Matched: serving, endpoint, router, model
Lambda · cloud · 2026-05-22
After 15 months of incremental updates, leaks, and rumored leaks, DeepSeek released version 4. It arrived without the fanfare R1 and R1-preview commanded in early 2025. That quiet reception is the most interesting thing about the release....
High signal Matched: inference, serving, performance, cost, release, model, open-source
AMD ROCm Blogs · hardware · 2026-05-22
Triton Inference Server is an open-source platform designed to streamline AI inferencing. It supports the deployment, scaling, and inference of trained models from multiple frameworks, including ONNX Runtime, TensorFlow, PyTorch, and other...
High signal Matched: inference, inferencing, serving, triton, benchmark, model, cloud, open-source
Lambda · cloud · 2026-05-21
The unit of AI compute has shifted from single hosts to rack-scale systems that integrate NVIDIA GPUs, CPUs, scale-up networking fabrics, and liquid cooling, such as the NVIDIA GB300 NVL72 and NVIDIA Vera Rubin NVL72. Teams at the frontier...
High signal Matched: serving, performance, cloud, training, api
LMCache · open-source · 2026-05-21
A new system stack is quietly taking shape around LLM serving. What makes it interesting is not just how quickly it is evolving, but how familiar the shape of that evolution looks if you’ve spent time studying large-scale systems like the...
High signal Matched: serving, lmcache, api
Together AI · inference-infra · 2026-05-15
Together AI partners with Pearl Research Labs to launch a discounted Pearl-powered inference endpoint for Gemma-4-31B-it-pearl, using Proof of Useful Work to turn AI workloads into crypto emissions.
High signal Matched: inference, endpoint, cost, launch, research
vLLM Project · open-source · 2026-05-14
Expert parallelism (EP) is a key technique for serving Mixture-of-Experts (MoE) models at high throughput. WideEP deployments (where EP spans many workers) maximize KV cache capacity, enabling...
High signal Matched: serving, throughput, kv cache, moe
NVIDIA Technical Blog · hardware · 2026-05-12
The path from a trained AI model to production should be smooth, but rarely is. Many teams invest weeks fine-tuning models, only to discover that exporting to a...
High signal Matched: serving, model, fine-tuning
Nota AI · korea · 2026-05-11
Jaehoon Lee Technical Content Manager, Nota AI NetsPresso® now embraces AI agents. An easy-to-use interface sits on top of the validated pipeline that handles everything from model compression to device deployment.When a user...
High signal Matched: inference, endpoint, kernel, verification, moe, benchmark, latency, cost, gpu, release, model, evaluation, quantization, quantized, int4, evaluate, benchmarks, swe-bench, mmlu, agent, agents, api
Together AI · inference-infra · 2026-05-11
DeepSeek-V4 makes million-token context a serving-systems problem. Together AI explores the inference work behind V4 on NVIDIA HGX B200, including compressed KV layouts, prefix caching, kernel maturity, and endpoint profiles for long-conte...
High signal Matched: inference, serving, endpoint, kernel, b200, long-context
BAIR · research · 2026-05-08
.apr-fig { text-align: center; margin: 1.35em 0; line-height: 1.4; } .apr-fig--wide img { display: inline-block; width: 100%; max-width: 100%; height: auto; vertical-align: middle; } .apr-fig--wide-0-8 { max-width: 80%; margin-left: auto;...
High signal Matched: inference, decoding, prefill, generation, serve, throughput, kv cache, verification, performance, latency, cost, model, paper, research, evaluation, training, pretraining, sft, benchmarks, long context, context window, agentic, reasoning model
vLLM Project · open-source · 2026-05-06
TL;DR: Agentic workloads generate massive shared prefixes that are often recomputed across turns. By integrating Mooncake's distributed KV cache store into vLLM, we achieve 3.8x higher throughput,...
High signal Matched: serving, throughput, distributed, kv cache, agentic
Cloudflare Blog · cloud · 2026-05-01
Dynamic Workflows is a library that lets you route durable execution to tenant-provided code on the fly. Built on Dynamic Workers, it enables platforms to serve millions of unique workflows at near-zero idle cost.
High signal Matched: serve, cost, introducing
vLLM Project · open-source · 2026-04-22
Long-context LLM serving is increasingly memory-bound: for standard full-attention decoders, the KV cache often dominates GPU memory at 128k+ contexts, and each decode step must read a large...
High signal Matched: serving, kv cache, gpu, fp8, quantization, long-context
vLLM Project · open-source · 2026-04-21
Hybrid architectures that interleave Mamba-style SSM layers with standard full-attention (FA) layers — such as NVIDIA Nemotron-H — are gaining traction as a way to combine the linear-time...
High signal Matched: serving
NVIDIA Technical Blog · hardware · 2026-04-20
As LLMs transition from simple text generation to complex reasoning, reinforcement learning (RL) plays a central role. Algorithms like Group Relative Policy...
High signal Matched: generation, throughput, fp8, training
NVIDIA Technical Blog · hardware · 2026-04-20
AI tools are significantly accelerating software development and changing how developers work with code. These tools serve as real-time copilots, automating...
High signal Matched: serve, agents, agentic
LMCache · open-source · 2026-04-04
Modern LLM serving workloads are defined by strict latency requirements, high concurrency, and rapidly growing context lengths. Applications such as multi-turn chat, AI agents, and retrieval-augmented generation continuously build on prior...
High signal Matched: inference, serving, decoding, generation, throughput, lmcache, moe, performance, latency, ttft, retrieval-augmented generation, retrieval, agents
NVIDIA Technical Blog · hardware · 2026-04-02
In vision AI systems, model throughput continues to improve. The surrounding pipeline stages must keep pace, including decode, preprocessing, and GPU...
High signal Matched: throughput, gpu, model
NVIDIA Technical Blog · hardware · 2026-04-01
Co-designed hardware, software, and models are key to delivering the highest AI factory throughput and lowest token cost. Measuring this goes far beyond peak...
High signal Matched: throughput, cost
Nota AI · korea · 2026-03-31
Jaehoon Lee Technical Content Manager, Nota AI In March, a single official announcement from Google Research rocked trillions of won in the market capitalization of U.S. infrastructure and semiconductor stocks. The catalyst:...
High signal Matched: inference, serving, generation, throughput, kv cache, benchmark, performance, cost, b200, blackwell, introducing, model, fp8, research, training, fine-tuning, quantization, quantized, agent, agentic, frontier model
NVIDIA Technical Blog · hardware · 2026-03-25
In production Kubernetes environments, the difference between model requirements and GPU size creates inefficiencies. Lightweight automatic speech recognition...
High signal Matched: throughput, gpu, model
Nota AI · korea · 2026-03-23
Jaehoon Lee Technical Content Manager, Nota AI GTC has evolved far beyond a technology conference, drawing attention from global economies and financial markets alike. This year, CEO Jensen Huang took the stage in his tradema...
High signal Matched: inference, prefill, generation, throughput, cuda, kv cache, performance, latency, cost, gpu, npu, launch, model, research, cloud, training, long-context, context window, agent, agents, agentic, open-source
NVIDIA Technical Blog · hardware · 2026-03-23
As large language model (LLM) inference workloads grow in complexity, a single monolithic serving process starts to hit its limits. Prefill and decode stages...
High signal Matched: inference, serving, prefill, model
Together AI · inference-infra · 2026-03-18
Together AI expands fine-tuning with native support for tool call, reasoning, and vision-language models, plus 100B+ model training, up to 6× higher throughput, and job cost and ETA estimates.
High signal Matched: throughput, cost, model, training, fine-tuning
Hugging Face · open-source · 2026-03-17
No feed summary available yet.
High signal Matched: throughput, agent, computer use
Nota AI · korea · 2026-03-13
Hancheol Park, Ph. D. AI Research Engineer, Nota AI Tairen PiaoAI Research Engineer, Nota AI Tae-Ho KimCTO & Co-Founder, Nota AI ✔️ Resource : The official quantized model of Solar-Open-100B, which passed the first round of Sout...
High signal Matched: inference, serving, prefill, generation, throughput, moe, router, benchmark, performance, latency, ttft, tpot, blackwell, release, model, weights, open model, research, evaluation, korea, korean, upstage, training, post-training, quantization, quantized, int4, evaluate, benchmarks, mmlu, long-context
BAIR · research · 2026-03-13
--> Understanding the behavior of complex machine learning systems, particularly Large Language Models (LLMs), is a critical challenge in modern artificial intelligence. Interpretability research aims to make the decision-making process mo...
High signal Matched: inference, serving, decoding, performance, cost, model, research, training, evaluate, mmlu, long-context, rag
Together AI · inference-infra · 2026-03-05
As GPU throughput outpaces memory bandwidth, kernels must evolve. We introduce FlashAttention-4, featuring new pipelining for maximum overlap, 2-CTA MMA modes to reduce shared memory traffic, and a hardware-software hybrid approach to soft...
High signal Matched: throughput, kernel, flashattention, gpu
Together AI · inference-infra · 2026-03-04
Serving long prompts doesn't have to mean slow responses. Learn how Together AI's CPD architecture separates warm and cold inference workloads to deliver 40% higher throughput and dramatically lower time-to-first-token for long-context LLM...
High signal Matched: inference, serving, prefill, throughput, long-context
Modal · inference-infra · 2026-03-04
A roundup of everything we shipped in February: Directory Snapshots for Sandboxes, a free GLM-5 endpoint, new billing API, and more.
High signal Matched: endpoint, api
vLLM Project · open-source · 2026-02-26
Organizations and individuals running multiple custom AI models, especially recent Mixture of Experts (MoE) model families, can face the challenge of paying for idle GPU capacity when the...
High signal Matched: serve, moe, mixture of experts, gpu, model, sagemaker, bedrock
vLLM Project · open-source · 2026-02-13
DeepSeek-V3.2 (NVFP4 + TP2)has been successfully and smoothly run on GB300 (SM103 - Blackwell Ultra). Leveraging FP4 quantization, it achieves a single-GPU throughput of 7360 TGS (tokens / GPU /...
High signal Matched: throughput, deepseek-v3, performance, gpu, blackwell, quantization
Google Research · big-tech · 2026-02-11
Algorithms & Theory
High signal Matched: throughput
llm-d · open-source · 2026-02-10
llm-d's new filesystem backend offloads KV cache to shared storage, enabling cross-replica reuse and up to 16.8x faster TTFT — scaling inference throughput without GPU or CPU memory limits.
High signal Matched: inference, throughput, kv cache, ttft, gpu
vLLM Project · open-source · 2026-02-03
Building on our previous work achieving 2.2k tok/s/H200 decode throughput with wide-EP, the vLLM team has continued performance optimization efforts targeting NVIDIA's GB200 platform. This blog...
High signal Matched: serving, throughput, performance, h200, gb200, blackwell
Together AI · inference-infra · 2026-01-22
Learn how to reduce inference latency without massive cost using proven inference optimization tactics — improving throughput, GPU utilization, and cost efficiency while balancing throughput vs. latency tradeoffs.
High signal Matched: inference, throughput, latency, cost, gpu
vLLM Project · open-source · 2026-01-08
In this post, we will describe the new KV cache offloading feature that was introduced in vLLM 0.11.0. We will focus on offloading to CPU memory (DRAM) and its benefits to improving overall...
High signal Matched: inference, throughput, kv cache
SqueezeBits · korea · 2026-01-07
A recap of the Intel® Gaudi® hands-on workshop co-hosted by SqueezeBits and Lablup. AI model compression, fine-tuning, and vLLM serving on Gaudi® hardware with Backend.AI.
High signal Matched: serving, model, fine-tuning
SqueezeBits · korea · 2025-12-24
Introducing ATOM™-Max, rebellions’ next-generation NPU designed for high-performance AI inference. Learn how its runtime, profiling tools, and PyTorch-native integrations enable developers to run and serve models efficiently without sacrif...
High signal Matched: inference, generation, serve, performance, npu, introducing, rebellions
Nota AI · korea · 2025-12-19
Seungmin YangEdgeFM Lead, Nota AI On this page ▾ SummaryWith the introduction of NVFP4—a new 4-bit floating point data type in NVIDIA’s Blackwell GPU architecture—LLM inference achieves markedly improved efficiency.Blackwell’s NVFP4...
High signal Matched: inference, serving, decoding, prefill, generation, token generation, throughput, kernel, gemm, cutlass, distributed, benchmark, performance, latency, ttft, tpot, tokens/sec, cost, gpu, blackwell, launch, model, weights, fp8, research, training, post-training, quantization, quantized, awq, benchmarks, mmlu, retrieval
vLLM Project · open-source · 2025-12-17
In v0.11.0, the last code from vLLM V0 engine was removed, marking the complete migration to the improved V1 engine architecture. This achievement would not have been possible without vLLM’s...
High signal Matched: serving, h200
vLLM Project · open-source · 2025-12-15
Modern Large Multimodal Models (LMMs) introduce a unique serving-time bottleneck: before any text generation can begin, all images must be processed by a visual encoder (e.g., ViT). This encoder...
High signal Matched: serving, generation, model
vLLM Project · open-source · 2025-12-13
Efficiently managing request distribution across a fleet of model replicas is a critical requirement for large-scale, production vLLM deployments. Standard load balancers often fall short as they...
High signal Matched: serving, prefill, router, performance, model
vLLM Project · open-source · 2025-12-09
Achieve faster, more efficient LLM serving without sacrificing accuracy!
High signal Matched: serving, quantization
vLLM Project · open-source · 2025-11-30
We are excited to announce the official release of vLLM-Omni, a major extension of the vLLM ecosystem designed to support the next generation of AI: omni-modality models.
High signal Matched: serving, generation, release, model
AIBrix · open-source · 2025-11-26
In recent years, large language models (LLMs) such as GPT, DeepSeek, Doubao and Qwen have advanced rapidly and are reshaping a wide range of industries. As the Scaling Law continues to be validated and pushed to its limits, LLM capabilitie...
High signal Matched: inference, serving, generation, throughput, performance, latency, cost
vLLM Project · open-source · 2025-11-22
Ray now has a new command: ray symmetric-run. This command makes it possible to launch the same entrypoint command on every node in a Ray cluster, simplifying the workflow to spawn vLLM servers...
High signal Matched: serving, multi-node, launch
Modal · inference-infra · 2025-10-29
We've collaborated with Datalab, the creators of Marker and Surya, to make it faster than ever to deploy document intelligence workflows.
High signal Matched: throughput
SqueezeBits · korea · 2025-10-28
Explore how Intel’s new Gaudi-3 compares to Gaudi-2, NVIDIA A100, and H100. We analyze real-world GEMM efficiency, attention performance, and LLM serving results to uncover what truly matters for AI inference and training workloads.
High signal Matched: inference, serving, gemm, performance, h100, training
llm-d · open-source · 2025-09-24
See how llm-d's precise KV-cache aware scheduling delivers 57x faster responses and 2x throughput in production distributed LLM inference benchmarks.
High signal Matched: inference, throughput, distributed, benchmarks
llm-d · open-source · 2025-09-03
Learn how llm-d's intelligent inference scheduling uses prefix-aware, load-balanced routing to maximize LLM throughput and minimize latency on Kubernetes.
High signal Matched: inference, throughput, latency
AIBrix · open-source · 2025-08-05
AIBrix is a composable, cloud‑native LLM inference infrastructure designed to deliver high performance and low cost at scale. We now present a major update in a new release - v0.4.0. This release tackles key bottlenecks in orchestration an...
High signal Matched: inference, prefill, generation, token generation, throughput, performance, cost, gpu, release, cloud
llm-d · open-source · 2025-06-25
Help shape llm-d's future: Take our 5-minute community survey, subscribe to our YouTube channel, and access exclusive resources for LLM serving innovation.
High signal Matched: serving
llm-d · open-source · 2025-05-20
Introducing llm-d: Kubernetes-native distributed LLM inference with KV-cache routing, disaggregated serving, and SOTA performance per dollar. Built on vLLM.
High signal Matched: inference, serving, distributed, performance, introducing, sota
AIBrix · open-source · 2025-02-19
We’re excited to announce the v0.2.0 release of AIBrix! Building on feedback from v0.1.0 production adoption and user interest, this release introduces several new features to enhance performance and usability. Extend the vLLM Prefix...
High signal Matched: inference, serving, prefill, throughput, distributed, multi-node, kv cache, prefix cache, performance, cost, gpu, accelerator, release, agent
Modular · inference-infra · 2025-02-06
Paged Attention & Prefix Caching Now Available in MAX Serve
High signal Matched: serve, paged attention
Modular · inference-infra · 2025-01-30
Agentic Building Blocks: Creating AI Agents with MAX Serve and OpenAI Function Calling
High signal Matched: serve, agents, agentic, function calling
SqueezeBits · korea · 2025-01-20
This article provides a comparative analysis of serving vision-language models on vLLM and TensorRT-LLM.
High signal Matched: serving
Modular · inference-infra · 2024-12-17
MAX GPU: State of the Art Throughput on a New GenAI platform
High signal Matched: throughput, gpu, state of the art
Modular · inference-infra · 2024-12-17
Build a Continuous Chat Interface with Llama 3 and MAX Serve
High signal Matched: serve
SqueezeBits · korea · 2024-12-05
This article provides a comparative analysis of multi-LoRA serving capabilities of vLLM and TensorRT-LLM frameworks.
High signal Matched: serving, lora
SqueezeBits · korea · 2024-10-11
This article provides a comparative analysis of vLLM and TensorRT-LLM frameworks, focusing on batching configurations and thoroughly examining the effects of maximum batch size and maximum number of tokens.
High signal Matched: serving
SqueezeBits · korea · 2024-10-01
This article provides a comparative analysis of vLLM and TensorRT-LLM frameworks for serving LLMs, evaluating their performance based on key metrics like throughput, TTFT, and TPOT to offer insights for practitioners in optimizing LLM depl...
High signal Matched: serving, throughput, performance, ttft, tpot, evaluation, evaluating
Modal · inference-infra · 2024-09-16
Learn how we used our new dynamic batching feature to improve throughput and reduce inference costs for the Whisper model with a single line of code!
High signal Matched: inference, throughput, model
Hugging Face · open-source · 2024-07-18
No feed summary available yet.
High signal Matched: serve, lora
SkyPilot · open-source · 2024-07-11
Develop, Train and Serve AI on Kubernetes with SkyPilot.
High signal Matched: serve
SkyPilot · open-source · 2024-02-20
SkyServe: A simple, cost-efficient, multi-region/cloud library for serving GenAI models.
High signal Matched: serving, cost, introducing, cloud
SkyPilot · open-source · 2023-12-21
A tutorial for serving Mixtral 8x7B model with SkyPilot and SkyServe.
High signal Matched: serving, mixtral, cost, gpu, model
SkyPilot · open-source · 2023-06-29
SkyPilot makes the deployment and development of vLLM easy and fast on clouds.
High signal Matched: serving, cloud
Hugging Face · open-source · 2022-08-11
No feed summary available yet.
High signal Matched: serving
Hugging Face · open-source · 2022-07-25
No feed summary available yet.
High signal Matched: serving
Cloudflare Blog · cloud · 2026-05-07
On May 5, 2026, DENIC published broken DNSSEC signatures for the .de TLD, making millions of domains unreachable. Here's what 1.1.1.1 saw, how serve stale cushioned the impact, and how we restored resolution.
Watchlist Matched: serve
BAIR · research · 2025-03-25
Training Diffusion Models with Reinforcement Learning We deployed 100 reinforcement learning (RL)-controlled cars into rush-hour highway traffic to smooth congestion and reduce fuel consumption for everyone. Our goal is to tackle "stop-and...
Watchlist Matched: throughput, kernel, performance, model, paper, training, agent, agents