MLSys Radar

inference

Moreh · korea · 2026-06-03

Optimizing Long-Context Prefill on Multiple (Older-Generation) GPU Nodes

Score 23

No feed summary available yet.

inference hardware long-context

High signal Matched: prefill, generation, gpu, long-context

NVIDIA Dynamo · open-source · 2026-06-03

Full-Stack Optimizations for Agentic Inference

Score 19

No feed summary available yet.

inference agents

High signal Matched: inference, agentic

VESSL AI · korea · 2026-06-03

GTC 2026: GPU Infra Trends — Inference to Physical AI

Score 15

No feed summary available yet.

inference hardware

High signal Matched: inference, gpu

Moreh · korea · 2026-06-03

Distributed Inference on Heterogeneous Accelerators Including GPUs, Rubin CPX, and AI Accelerators

Score 15

No feed summary available yet.

inference distributed

High signal Matched: inference, distributed

Moreh · korea · 2026-06-03

21K Output Tokens Per Second DeepSeek Inference on AMD Instinct MI300X GPUs with Expert Parallelism

Score 15

No feed summary available yet.

inference hardware

High signal Matched: inference, mi300x

NVIDIA Dynamo · open-source · 2026-06-03

Disaggregated Serving

Score 15

No feed summary available yet.

inference serving

High signal Matched: serving

Mooncake · open-source · 2026-06-03

SGLang Disaggregated Serving with MooncakeTransferEngine

Score 13

No feed summary available yet.

inference serving

High signal Matched: serving

Mooncake · open-source · 2026-06-03

LMDeploy Disaggregated Serving with MooncakeTransferEngine

Score 13

No feed summary available yet.

inference serving

High signal Matched: serving

Gcore · cloud · 2026-06-03

Everywhere AI Scalable enterprise AI training and inference across environments

Score 13

No feed summary available yet.

inference training

High signal Matched: inference, training

Perplexity Research · model-lab · 2026-06-03

Rethinking Search as Code GenerationRethinking Search as Code Generation

Score 12

No feed summary available yet.

High signal Matched: generation

Perplexity Research · model-lab · 2026-06-03

AI Inference EngineerNew York City; Palo Alto; San Francisco

Score 12

No feed summary available yet.

High signal Matched: inference

KubeAI · open-source · 2026-06-03

Configure Text Generation Models

Score 11

No feed summary available yet.

High signal Matched: generation

Gcore · cloud · 2026-06-03

Everywhere Inference Public serverless inference for real-time AI workloads

Score 9

No feed summary available yet.

High signal Matched: inference

BentoML · inference-infra · 2026-06-03

Bento Inference PlatformFull control without the complexity. Self-host anywhere. Serve any model. Optimize for performance.Book a Demo

Score 25

No feed summary available yet.

inference serving benchmark model-release

High signal Matched: inference, serve, performance, model

TensorRT-LLM · open-source · 2026-06-03

Distributed LLM Generation

Score 19

No feed summary available yet.

inference distributed

High signal Matched: generation, distributed

TensorRT-LLM · open-source · 2026-06-03

Speculative Decoding

Score 19

No feed summary available yet.

inference speculative-decoding

High signal Matched: decoding, speculative decoding

BentoML · inference-infra · 2026-06-03

BentoML Open-SourceThe most flexible way to serve AI/ML models and custom inference pipelines in productionGitHub

Score 17

No feed summary available yet.

inference serving

High signal Matched: inference, serve

TensorRT-LLM · open-source · 2026-06-03

Generate text with guided decoding

Score 15

No feed summary available yet.

High signal Matched: decoding

BentoML · inference-infra · 2026-06-03

LLM Inference Handbook

Score 13

No feed summary available yet.

High signal Matched: inference

Nebius · cloud · 2026-06-03

LK losses: Training speculative decoding draft models to directly maximize acceptance rate

Score 13

No feed summary available yet.

inference speculative-decoding training

High signal Matched: decoding, speculative decoding, training

Crusoe · cloud · 2026-06-03

Model Inference

Score 13

No feed summary available yet.

inference model-release

High signal Matched: inference, model

Crusoe · cloud · 2026-06-03

Managed Inference

Score 9

No feed summary available yet.

High signal Matched: inference

FriendliAI · inference-infra · 2026-06-03

Scale Beyond GPU Memory Limits with Host KV Cache for Dedicated Endpoints

Score 23

No feed summary available yet.

inference kv-cache hardware

High signal Matched: inference, kv cache, gpu

LightSeek Foundation · research · 2026-06-03

TorchSpec: Speculative Decoding Training at Scale

Score 21

No feed summary available yet.

inference speculative-decoding model-release training

High signal Matched: inference, decoding, speculative decoding, model, training

FriendliAI · inference-infra · 2026-06-03

⚡ Hit your SLA, cut costs. Download the Friendli Guide to Inference Performance Optimization ➜

Score 19

No feed summary available yet.

inference benchmark

High signal Matched: inference, performance

FuriosaAI · hardware · 2026-06-03

FuriosaAI partners with Broadcom to build next-generation inference platform for the Agentic Era

Score 19

No feed summary available yet.

inference agents

High signal Matched: inference, generation, agentic

Fireworks AI · inference-infra · 2026-06-03

Training-Inference Parity in MoE Models: Where Numerics Drift

Score 19

No feed summary available yet.

High signal Matched: inference, moe

LightSeek Foundation · research · 2026-06-03

EAGLE 3.1: Advancing Speculative Decoding Through Collaboration Between the EAGLE Team, vLLM, and TorchSpec

Score 17

No feed summary available yet.

inference speculative-decoding training

High signal Matched: decoding, speculative decoding, eagle, training

LightSeek Foundation · research · 2026-06-03

TokenSpeed: A Speed-of-Light LLM Inference Engine for Agentic Workloads

Score 17

No feed summary available yet.

inference kernel benchmark agents

High signal Matched: inference, kernel, performance, agentic

FriendliAI · inference-infra · 2026-06-03

FriendliAI Expands to San Francisco to Scale Frontier AI Inference for Open-Weight and Custom Models

Score 15

No feed summary available yet.

High signal Matched: inference

Fireworks AI · inference-infra · 2026-06-03

Introducing Fireworks on Microsoft Foundry: Bringing Best-in-Class Open Model inference to Azure

Score 15

No feed summary available yet.

inference model-release open-source

High signal Matched: inference, model, open model

Baseten · inference-infra · 2026-06-03

Timestep distillation: 2.5x faster FLUX.2 image generation

Score 15

No feed summary available yet.

High signal Matched: generation

Mistral AI · model-lab · 2026-06-03

Compute Frontier-scale infrastructure for training and inference.

Score 14

No feed summary available yet.

inference training

High signal Matched: inference, training

Cerebras · hardware · 2026-06-03

Cerebras Brings Trillion Parameter Inference to Enterprises with Kimi K2.6 >>

Score 13

No feed summary available yet.

High signal Matched: inference

Together AI · inference-infra · 2026-06-02

Serving MiniMax-M3 for efficient inference: Unlocking 1M-Token Context and Multimodality Without Regrets

Score 17

How Together served MiniMax-M3 efficiently with KV-block-major sparse attention, paged MSA decode, optimized index scoring, and a Rust-based multimodal gateway.

inference serving

High signal Matched: inference, serving

AWS Machine Learning Blog · cloud · 2026-06-02

OpenAI models and Codex on Amazon Bedrock are now generally available

Score 13

GPT-5.5, GPT-5.4, and Codex are now generally available on Amazon Bedrock. Deploy them in production applications and agents today, on Bedrock’s high performance inference engine. 

inference benchmark cloud agents

High signal Matched: inference, performance, bedrock, agents

AWS Machine Learning Blog · cloud · 2026-06-02

Accelerate LLM model loading and increase context windows with GPUDirect on Amazon FSx for Lustre and TurboQuant

Score 15

If you’re iterating on deploying large language models (LLMs) on AWS GPU instances, you’ve probably noticed the larger the model to be loaded into GPU High Bandwidth Memory (HBM), the longer the painful wait until the GPUs are ready for in...

inference hardware model-release

High signal Matched: inference, gpu, model

vLLM Project · open-source · 2026-06-02

Accelerating vLLM-Omni Inference with AutoRound Quantization

Score 13

We are excited to announce that AutoRound — Intel's state-of-the-art post-training quantization (PTQ) algorithm — is now fully integrated into vLLM-Omni, enabling a streamlined quantize-once,...

inference training quantization

High signal Matched: inference, training, post-training, quantization

Lambda · cloud · 2026-06-01

Unbox one of NVIDIA's first co-packaged optics switches with us. See why we bet on CPO early.

Score 15

When we design large GPU clusters, the network is no longer a background system. It's part of the compute envelope. At the 800G and NVIDIA GB300 NVL72 scale, the back-end fabric accounts for 86% of networking power in a three-layer cluster...

inference serving distributed benchmark hardware model-release rag agents

High signal Matched: generation, token generation, throughput, infiniband, gpu, model, retrieval, agentic

vLLM Project · open-source · 2026-06-01

vLLM on the DGX Spark: Architecture, Configuration, and Local Evaluation

Score 17

A technical deep dive on running vLLM on NVIDIA DGX Spark and GB10 systems, covering sm_121 architecture, unified memory behavior, NVFP4 model serving, Nemotron-3-Super configuration, Docker deployment, Prometheus metrics, and local evalua...

inference serving model-release research evals

High signal Matched: serving, model, evaluation

AWS Machine Learning Blog · cloud · 2026-05-30

Comprehensive observability for Amazon SageMaker AI LLM inference: From GPU utilization to LLM quality

Score 17

This post demonstrates a comprehensive observability solution using Amazon Managed Grafana dashboards that provides a holistic view of both quality and quantity for LLMs served on Amazon SageMaker AI endpoints with inference components.

inference hardware cloud

High signal Matched: inference, gpu, sagemaker

NVIDIA Technical Blog · hardware · 2026-05-29

DynoSim: Simulating the Pareto Frontier

Score 15

Modern LLM serving is hard to tune because each deployment is a stack of interacting choices: model backend, tensor-parallel shape, prefill/decode split, worker...

inference serving model-release

High signal Matched: serving, prefill, model

Nota AI · korea · 2026-05-29

Full-Stack Optimization for Low-Light Video on Jetson Orin NX: From 400 ms to 28 ms

Score 23

  Jaehoon Lee Technical Content Manager, Nota AI   When enterprises adopt AI, the most common bottleneck is not model development. It is the deployment stage: getting a finished model to run reliably on the actual target device.T...

inference serving benchmark hardware model-release research quantization evals

High signal Matched: inference, throughput, benchmark, performance, latency, cost, gpu, model, evaluation, quantization, int8, benchmarks, leaderboard

Together AI · inference-infra · 2026-05-29

How Together AI built the world’s fastest speech-to-text stack

Score 13

Together AI built the fastest speech-to-text stack on Artificial Analysis by treating ASR as a full-path systems problem, not just a GPU inference problem.

inference hardware

High signal Matched: inference, gpu

AWS Machine Learning Blog · cloud · 2026-05-29

Claude Opus 4.8 is now available on AWS

Score 11

This post covers Opus 4.8's improvements and practical guidance for AI engineers integrating the model into agentic systems and production inference workloads on Amazon Bedrock.

inference model-release cloud agents

High signal Matched: inference, model, bedrock, agentic

NVIDIA Technical Blog · hardware · 2026-05-29

Run Step 3.7 Flash on NVIDIA GPUs with Enterprise-Ready Multimodal AI

Score 11

AI applications are moving beyond text generation to multimodal systems that can perceive, search, and reason across images, documents, video, and...

High signal Matched: generation

AMD ROCm Blogs · hardware · 2026-05-29

Enabling Speculative Speculative Decoding on MI300X

Score 29

Speculative speculative decoding (SSD) [1] is a recently proposed speculative decoding (SD) algorithm that further accelerates large language model (LLM) inference beyond conventional SD. In standard SD, a small draft model proposes severa...

inference speculative-decoding benchmark hardware model-release

High signal Matched: inference, decoding, speculative decoding, draft model, verification, cost, mi300x, model

PyTorch Foundation · open-source · 2026-05-28

Up to 580tps! New Speed Record of Qwen3.5-397B-A17B on GPU for Agentic Workloads with TokenSpeed

Score 17

TL;DR: The TokenSpeed inference engine achieved a record-breaking 580 tps running the Qwen3.5-397B-A17B model on GPUs. This extreme performance for agentic workloads is driven by systematic elimination of memory copies,...

inference benchmark hardware model-release agents

High signal Matched: inference, performance, gpu, model, agentic

vLLM Project · open-source · 2026-05-28

Speculators v0.5.0: DFlash Support and Online Training

Score 19

The v0.5.0 release brings significant architectural improvements to speculative decoding model training, introducing DFlash algorithm support, fully unified online training capabilities, and a...

inference speculative-decoding model-release training

High signal Matched: decoding, speculative decoding, release, introducing, model, training

vLLM Project · open-source · 2026-05-28

From Text to Multimodal Routing: Hardening Vision Signals in vLLM Semantic Router

Score 19

Most routing systems start with a prompt and choose a model endpoint. vLLM Semantic Router (VSR) makes a different bet: before a request reaches the serving model, the system should extract...

inference serving moe model-release api

High signal Matched: serving, endpoint, router, model

vLLM Project · open-source · 2026-05-28

Accelerating Laguna XS.2 Inference with vLLM, Speculators, and LLM Compressor

Score 15

As organizations increasingly adopt AI-powered development tools, the need for high-performance agentic models that deliver both accuracy and operational efficiency has become critical. Laguna...

inference benchmark agents

High signal Matched: inference, performance, agentic

vLLM Project · open-source · 2026-05-28

Native RL APIs in vLLM

Score 11

As post-training workloads continue to scale, we've seen widespread adoption of vLLM as the inference engine of choice. However, two issues repeatedly arise:

inference training

High signal Matched: inference, training, post-training

NVIDIA Technical Blog · hardware · 2026-05-27

NVIDIA Dynamo Snapshot: Fast Startup for Inference Workloads on Kubernetes

Score 13

The cold-start problem In production inference deployments, demand fluctuates over time, requiring inference replicas to scale elastically. However,...

High signal Matched: inference

NVIDIA Technical Blog · hardware · 2026-05-27

NVIDIA Blackwell Sets STAC-AI Record for LLM Inference in Finance

Score 17

Large language models (LLMs) are revolutionizing the financial trading landscape by enabling sophisticated analysis of vast amounts of unstructured data to...

inference hardware

High signal Matched: inference, blackwell

NVIDIA Technical Blog · hardware · 2026-05-27

What’s New for Game Developers in NVIDIA RTX: DLSS 4.5 for UE5 and Multilingual AI Characters

Score 11

NVIDIA RTX provides game developers with direct paths to AI-driven characters, frame generation, and ray-traced rendering. This post walks through a meaningful...

High signal Matched: generation

vLLM Project · open-source · 2026-05-26

EAGLE 3.1: Advancing Speculative Decoding Through Collaboration Between the EAGLE Team, vLLM, and TorchSpec

Score 22

The EAGLE series — including EAGLE 1, EAGLE 2, and EAGLE 3 — has become one of the most widely adopted and practically deployed families of speculative decoding algorithms across both research and...

inference speculative-decoding research

High signal Matched: decoding, speculative decoding, eagle, research

AMD ROCm Blogs · hardware · 2026-05-25

AI Inference on AMD Ryzen™ AI Max Processor

Score 20

Local large language model (LLM) inference has rapidly evolved, but a persistent limitation remains: model size is constrained by available GPU memory. Discrete GPUs typically offer 8–24 GB of dedicated VRAM, which can limit the size of mo...

inference distributed hardware model-release cloud quantization evals

High signal Matched: inference, multi-gpu, gpu, model, checkpoint, cloud, quantization, evaluate

Lambda · cloud · 2026-05-22

DeepSeek V4: the most expected open-source model ever released, and the quietest landing

Score 18

After 15 months of incremental updates, leaks, and rumored leaks, DeepSeek released version 4. It arrived without the fanfare R1 and R1-preview commanded in early 2025. That quiet reception is the most interesting thing about the release....

inference serving benchmark model-release open-source

High signal Matched: inference, serving, performance, cost, release, model, open-source

AMD ROCm Blogs · hardware · 2026-05-22

From Build to Benchmark: ONNX Model Serving with Triton Inference Server on AMD GPUs

Score 30

Triton Inference Server is an open-source platform designed to streamline AI inferencing. It supports the deployment, scaling, and inference of trained models from multiple frameworks, including ONNX Runtime, TensorFlow, PyTorch, and other...

inference serving kernel triton benchmark model-release cloud open-source

High signal Matched: inference, inferencing, serving, triton, benchmark, model, cloud, open-source

Lambda · cloud · 2026-05-21

Lambda Bare Metal Instances: full hardware control with API-driven operations

Score 8

The unit of AI compute has shifted from single hosts to rack-scale systems that integrate NVIDIA GPUs, CPUs, scale-up networking fabrics, and liquid cooling, such as the NVIDIA GB300 NVL72 and NVIDIA Vera Rubin NVL72. Teams at the frontier...

inference serving benchmark cloud training api

High signal Matched: serving, performance, cloud, training, api

Modular · inference-infra · 2026-05-21

Why LLM Inference Needs a New Kind of Router - Part 2

Score 14

Why LLM Inference Needs a New Kind of Router - Part 2

High signal Matched: inference, router

LMCache · open-source · 2026-05-21

OpenAI API Is the New IPv4

Score 10

A new system stack is quietly taking shape around LLM serving. What makes it interesting is not just how quickly it is evolving, but how familiar the shape of that evolution looks if you’ve spent time studying large-scale systems like the...

inference serving kv-cache api

High signal Matched: serving, lmcache, api

Lambda · cloud · 2026-05-20

Lambda’s NVIDIA HGX B200 on STAC-AI™ LANG6

Score 18

What the numbers mean for financial services Executive summary Lambda is the first to publish an audited STAC-AI™ LANG6 result on NVIDIA HGX B200, with independently verified performance data that Financial Services Industry (FSI) infrastr...

inference benchmark hardware model-release evals

High signal Matched: inference, generation, performance, gpu, h200, b200, model, evaluating

AMD ROCm Blogs · hardware · 2026-05-20

QuickReduce FP4 Quantization and Benchmarking on MI355

Score 12

Large Language Models (LLMs) typically contain billions — or even tens of billions — of parameters. During inference, tensor parallelism is commonly employed to distribute the workload across multiple GPUs. This approach demands frequent,...

inference benchmark model-release quantization

High signal Matched: inference, latency, introducing, quantization

Together AI · inference-infra · 2026-05-19

Benchmarking inference at scale: coding agents

Score 16

Real-world inference benchmarks for coding agents: 31% more TPS than TensorRT-LLM, 2× better TTFT at saturation, and 76% lower cost than Claude Opus 4.6.

inference benchmark evals agents

High signal Matched: inference, ttft, cost, benchmarks, agents

PyTorch Foundation · open-source · 2026-05-19

Running PyTorch Models on Apple Silicon GPUs with the ExecuTorch MLX Delegate

Score 14

TL;DR: Introducing the ExecuTorch MLX Delegate The new MLX delegate enables optimized, GPU-accelerated inference for PyTorch models on Apple Silicon Macs, using Apple’s MLX framework. The delegate seamlessly integrates with...

inference hardware model-release

High signal Matched: inference, gpu, introducing

Modular · inference-infra · 2026-05-18

Hippocratic AI partners with Modular to power flexible, high-quality inference for real-time patient conversations

Score 10

Hippocratic AI partners with Modular to power flexible, high-quality inference for real-time patient conversations

High signal Matched: inference

vLLM Project · open-source · 2026-05-18

vLLM x Novita AI: PegaFlow for Production-Grade External KV Cache

Score 14

TL;DR: In collaboration with Novita AI, PegaFlow integrates with vLLM as an external KV cache service for LLM inference, implemented as a standalone Rust process and connected through the external...

inference kv-cache

High signal Matched: inference, kv cache

Together AI · inference-infra · 2026-05-15

Together AI and Pearl Research Labs Team Up to Reduce the Cost of AI Inference

Score 24

Together AI partners with Pearl Research Labs to launch a discounted Pearl-powered inference endpoint for Gemma-4-31B-it-pearl, using Proof of Useful Work to turn AI workloads into crypto emissions.

inference serving benchmark model-release research api

High signal Matched: inference, endpoint, cost, launch, research

NVIDIA Technical Blog · hardware · 2026-05-14

How the NVIDIA Vera Rubin Platform is Solving Agentic AI’s Scale-Up Problem

Score 12

Agentic inference has fundamentally changed the runtime dynamics of inference workloads by introducing non-deterministic trajectories—actions, observations,...

inference model-release agents

High signal Matched: inference, introducing, agentic

vLLM Project · open-source · 2026-05-14

Elastic Expert Parallelism in vLLM

Score 16

Expert parallelism (EP) is a key technique for serving Mixture-of-Experts (MoE) models at high throughput. WideEP deployments (where EP spans many workers) maximize KV cache capacity, enabling...

inference serving kv-cache moe benchmark

High signal Matched: serving, throughput, kv cache, moe

NVIDIA Technical Blog · hardware · 2026-05-12

How to Eliminate Pipeline Friction in AI Model Serving

Score 16

The path from a trained AI model to production should be smooth, but rarely is. Many teams invest weeks fine-tuning models, only to discover that exporting to a...

inference serving model-release fine-tuning

High signal Matched: serving, model, fine-tuning

Modular · inference-infra · 2026-05-12

Inkwell: Why Your Inference Platform Matters As Much As Your Model

Score 14

Inkwell: Why Your Inference Platform Matters As Much As Your Model

inference model-release

High signal Matched: inference, model

Hugging Face · open-source · 2026-05-12

Building Blocks for Foundation Model Training and Inference on AWS

Score 14

No feed summary available yet.

inference model-release training

High signal Matched: inference, model, training

Nota AI · korea · 2026-05-11

[NetsPresso® x AI Agents] Easier to Use, Even More Powerful

Score 52

  Jaehoon Lee Technical Content Manager, Nota AI   NetsPresso® now embraces AI agents. An easy-to-use interface sits on top of the validated pipeline that handles everything from model compression to device deployment.When a user...

inference serving kernel speculative-decoding moe benchmark hardware model-release research quantization evals agents api

High signal Matched: inference, endpoint, kernel, verification, moe, benchmark, latency, cost, gpu, release, model, evaluation, quantization, quantized, int4, evaluate, benchmarks, swe-bench, mmlu, agent, agents, api

Together AI · inference-infra · 2026-05-11

Serving DeepSeek-V4: why million-token context is an inference systems problem

Score 22

DeepSeek-V4 makes million-token context a serving-systems problem. Together AI explores the inference work behind V4 on NVIDIA HGX B200, including compressed KV layouts, prefix caching, kernel maturity, and endpoint profiles for long-conte...

inference serving kernel hardware long-context api

High signal Matched: inference, serving, endpoint, kernel, b200, long-context

BAIR · research · 2026-05-08

Adaptive Parallel Reasoning: The Next Paradigm in Efficient Inference Scaling

Score 28

.apr-fig { text-align: center; margin: 1.35em 0; line-height: 1.4; } .apr-fig--wide img { display: inline-block; width: 100%; max-width: 100%; height: auto; vertical-align: middle; } .apr-fig--wide-0-8 { max-width: 80%; margin-left: auto;...

inference serving kv-cache speculative-decoding benchmark model-release research training fine-tuning evals long-context agents frontier-model

High signal Matched: inference, decoding, prefill, generation, serve, throughput, kv cache, verification, performance, latency, cost, model, paper, research, evaluation, training, pretraining, sft, benchmarks, long context, context window, agentic, reasoning model

NVIDIA Technical Blog · hardware · 2026-05-08

Improving Bash Generation in Small Language Models with Grammar-Constrained Decoding

Score 20

Bash is one of the most flexible and powerful interfaces exposed to AI agents. In the right system, a model that emits grep, curl, tar, or a shell pipeline is...

inference model-release agents

High signal Matched: decoding, generation, model, agents

Together AI · inference-infra · 2026-05-08

Deploy and inference any model from HuggingFace

Score 20

Learn how to deploy any Hugging Face model in one session using Goose and Together's Dedicated Container Inference. Skip the setup complexity — one prompt gets your model running in a production-grade GPU environment on release day.

inference hardware model-release

High signal Matched: inference, gpu, release, model

Modular · inference-infra · 2026-05-08

Why LLM Inference Needs a New Kind of Router - Part 1

Score 14

Why LLM Inference Needs a New Kind of Router - Part 1

High signal Matched: inference, router

NVIDIA Technical Blog · hardware · 2026-05-07

Model Quantization: Post-Training Quantization Using NVIDIA Model Optimizer

Score 16

Model quantization is an effective method to reduce VRAM usage and improve inference performance on consumer devices such as NVIDIA GeForce RTX GPUs. By...

inference benchmark model-release training quantization

High signal Matched: inference, performance, model, training, post-training, quantization

vLLM Project · open-source · 2026-05-06

Serving Agentic Workloads at Scale with vLLM x Mooncake

Score 18

TL;DR: Agentic workloads generate massive shared prefixes that are often recomputed across turns. By integrating Mooncake's distributed KV cache store into vLLM, we achieve 3.8x higher throughput,...

inference serving distributed kv-cache benchmark agents

High signal Matched: serving, throughput, distributed, kv cache, agentic

Together AI · inference-infra · 2026-05-04

Foundational research powering efficient inference at scale

Score 16

As AI moves from research to production, the challenge for AI-native teams shifts from building models to running them — efficiently, reliably, and at scale.

inference research

High signal Matched: inference, research

Modal · inference-infra · 2026-05-04

Boosting multimodal inference performance by >10% with a single Python dictionary

Score 16

If we've said it once, we've said it once per millisecond: never block the GPU.

inference benchmark hardware

High signal Matched: inference, performance, gpu

NVIDIA Technical Blog · hardware · 2026-04-30

Speed Up Unreal Engine NNE Inference with NVIDIA TensorRT for RTX Runtime

Score 14

Neural network techniques are increasingly used in computer graphics to boost image quality, improve performance, and streamline content creation. Approaches...

inference benchmark

High signal Matched: inference, performance

NVIDIA Technical Blog · hardware · 2026-04-30

Build AI-Powered Games with NVIDIA DLSS 4.5, RTX, and Unreal Engine 5

Score 10

Today, game developers can begin integrating NVIDIA DLSS 4.5 with Dynamic Multi Frame Generation, Multi Frame Generation 6X, and the second-generation...

High signal Matched: generation

Nota AI · korea · 2026-04-29

[NVIDIA Nemotron Hackathon] Grand Prize Among 20 Teams: Behind Two Sleepless Days

Score 32

  Hancheol Park, Ph. D.AI Research Engineer, NetsPresso Tech, Nota AI Geonmin Kim, Ph. D.AI Research Engineer, NetsPresso Tech, Nota AI Geonho LeeEdge AI Engineer Intern, NetsPresso Tech, Nota AI Jaehoon Lee Technical Content Manager,...

inference moe benchmark model-release research korea training fine-tuning quantization evals agents

High signal Matched: generation, moe, performance, model, weights, paper, research, evaluation, korea, korean, seoul, naver, training, fine-tuning, quantization, agent, agents, agentic

Hugging Face · open-source · 2026-04-29

DeepInfra on Hugging Face Inference Providers 🔥

Score 10

No feed summary available yet.

High signal Matched: inference

LMCache · open-source · 2026-04-29

Stop Calling It KV Cache: It’s Something Much Bigger

Score 14

For years, we have referred to one of the most critical components of modern LLM inference as a “KV cache.” That name made sense once. Today, it is increasingly misleading. What began as a small, ephemeral optimization inside a...

inference kv-cache

High signal Matched: inference, kv cache, lmcache

Modal · inference-infra · 2026-04-29

Building an RL theorem-proving workflow on Modal

Score 8

Learn how AE Studio used evolutionary algorithms on Modal to efficiently improve Lean proof generation.

High signal Matched: generation

NVIDIA Technical Blog · hardware · 2026-04-24

Build with DeepSeek V4 Using NVIDIA Blackwell and GPU-Accelerated Endpoints

Score 18

DeepSeek just launched its fourth generation of flagship models with DeepSeek-V4-Pro and DeepSeek-V4-Flash, both targeted at enabling highly efficient...

inference hardware

High signal Matched: generation, gpu, blackwell

Together AI · inference-infra · 2026-04-24

Accelerate RL rollouts by up to 50% with distribution-aware speculative decoding

Score 16

Rollout is the silent bottleneck in RL post-training. DAS fixes it with adaptive speculative decoding — up to 50% faster, zero degradation in reward quality.

inference speculative-decoding training

High signal Matched: decoding, speculative decoding, training, post-training

LMCache · open-source · 2026-04-23

LMCache on Amazon SageMaker HyperPod: Accelerating LLM Inference with Managed Tiered KV Cache

Score 30

Overview Large language model (LLM) inference performance depends heavily on how efficiently the system manages key-value (KV) cache — the stored attention states that allow the model to avoid recomputing previous tokens. As context length...

inference kv-cache benchmark hardware model-release cloud

High signal Matched: inference, kv cache, lmcache, performance, latency, gpu, model, sagemaker

Nota AI · korea · 2026-04-22

[Deep Dive: NetsPresso®] From Quantization to Graph Optimization: A Step-by-Step Model Deployment Pipeline

Score 54

  Jaehoon Lee Technical Content Manager, Nota AI   Series Notice: NetsPresso® Technical Blog, Part 2In Part 1, we walked through a scenario of deploying Llama 3.2 1B on an edge device to illustrate the NetsPresso® workflow. The f...

inference kernel cuda benchmark hardware model-release research korea training quantization evals api open-source

High signal Matched: inference, kernel, cuda, matmul, benchmark, performance, latency, cost, npu, model, weights, paper, research, evaluation, furiosa, training, quantization, int8, int4, awq, gptq, sdk, open-source

vLLM Project · open-source · 2026-04-22

The State of FP8 KV-Cache and Attention Quantization in vLLM

Score 18

Long-context LLM serving is increasingly memory-bound: for standard full-attention decoders, the KV cache often dominates GPU memory at 128k+ contexts, and each decode step must read a large...

inference serving kv-cache hardware model-release quantization long-context

High signal Matched: serving, kv cache, gpu, fp8, quantization, long-context

llm-d · open-source · 2026-04-21

Production-Grade LLM Inference at Scale with KServe, llm-d, and vLLM

Score 14

How migrating from a simple vLLM deployment to a robust MLOps platform utilizing KServe, llm-d's intelligent routing, and vLLM solved significant scaling and operational challenges in LLM deployment through deep customization and prefix-ca...

inference hardware

High signal Matched: inference, gpu

vLLM Project · open-source · 2026-04-21

Disaggregated Serving for Hybrid SSM Models in vLLM

Score 12

Hybrid architectures that interleave Mamba-style SSM layers with standard full-attention (FA) layers — such as NVIDIA Nemotron-H — are gaining traction as a way to combine the linear-time...

inference serving

High signal Matched: serving

NVIDIA Technical Blog · hardware · 2026-04-20

Run High-Throughput Reinforcement Learning Training with End-to-End FP8 Precision

Score 18

As LLMs transition from simple text generation to complex reasoning, reinforcement learning (RL) plays a central role. Algorithms like Group Relative Policy...

inference serving benchmark model-release training quantization

High signal Matched: generation, throughput, fp8, training

NVIDIA Technical Blog · hardware · 2026-04-17

Full-Stack Optimizations for Agentic Inference with NVIDIA Dynamo

Score 12

Coding agents are starting to write production code at scale. Stripe’s agents generate 1,300+ PRs per week. Ramp attributes 30% of merged PRs to agents....

inference agents

High signal Matched: inference, agents, agentic

LMCache · open-source · 2026-04-16

What is TurboQuant and why it matters for LLM inference, in laymen’s term

Score 16

TL;DR: TurboQuant allows you to put 4x more context in your GPU without blowing up GPU memory or dropping AI’s intelligence. It does so by quantizing the memory of large language models, also known as KV cache, an important bottleneck ment...

inference kv-cache hardware

High signal Matched: inference, kv cache, lmcache, gpu

SkyPilot · open-source · 2026-04-09

Research-Driven Agents: What Happens When Your Agent Reads Before It Codes

Score 16

Coding agents working from code alone generate shallow hypotheses. Adding a research phase — arxiv papers, competing forks, other backends — produced 5 kernel fusions that made llama.cpp CPU inference 15% faster.

inference kernel research agents

High signal Matched: inference, kernel, arxiv, research, agent, agents

Nota AI · korea · 2026-04-08

[Overview: NetsPresso®] A Platform That Handles Everything from Model Optimization to Target Deployment

Score 36

  Jaehoon Lee Technical Content Manager, Nota AI   AI Model Optimization: Why Models Won't Run on HardwareThe Chip Is Ready, but the Model Won't DeployIf you have ever tried deploying an AI model onto your own chip, the following...

inference distributed kv-cache speculative-decoding benchmark hardware model-release research quantization evals

High signal Matched: inference, multi-gpu, kv cache, verification, performance, latency, gpu, model, research, evaluation, quantization, quantized, awq, gptq, evaluate

Modal · inference-infra · 2026-04-08

Real-time inference for robots at Physical Intelligence

Score 10

How Physical Intelligence runs remote, real-time, robotic inference on Modal.

High signal Matched: inference

vLLM Project · open-source · 2026-04-07

Next-Level Inference: Why Your Single-Node vLLM Setup Needs Prefill-Decode Disaggregation

Score 22

TL;DR: Prefill and decode fight over the same GPUs, causing ITL spikes under load. We show how to disaggregate them on a single 8-GPU MI300X node using AMD's MORI-IO connector — achieving 2.5x...

inference benchmark hardware

High signal Matched: inference, prefill, itl, gpu, mi300x

LMCache · open-source · 2026-04-04

LMCache’s New Architecture Boosts MoE Inference Performance by 10×

Score 34

Modern LLM serving workloads are defined by strict latency requirements, high concurrency, and rapidly growing context lengths. Applications such as multi-turn chat, AI agents, and retrieval-augmented generation continuously build on prior...

inference serving kv-cache moe benchmark rag agents

High signal Matched: inference, serving, decoding, generation, throughput, lmcache, moe, performance, latency, ttft, retrieval-augmented generation, retrieval, agents

Together AI · inference-infra · 2026-04-03

Wan 2.7 video model suite now available on Together AI

Score 14

A four-model video suite for generation, continuation, reference-driven workflows, and editing, rolling out on Together AI starting with text-to-video.

inference model-release

High signal Matched: generation, model

LY Corporation Tech Blog · korea · 2026-04-02

Cloud infrastructure transformation at LY Corporation: introducing the architecture of Flava, the next-generation platform integrating two massive cl...

Score 14

Hello. I’m Inoue, and I work on private cloud infrastructure at LY Corporation.What powers LY Corpor...

inference model-release cloud

High signal Matched: generation, introducing, cloud

NVIDIA Technical Blog · hardware · 2026-04-02

Achieving Single-Digit Microsecond Latency Inference for Capital Markets

Score 8

In algorithmic trading, reducing response times to market events is crucial. To keep pace with high-speed electronic markets, latency-sensitive firms often use...

inference benchmark

High signal Matched: inference, latency

Together AI · inference-infra · 2026-04-02

Deepgram speech-to-text and voice models now available natively on Together AI

Score 14

Production STT and TTS from Deepgram, available on Together AI Dedicated Model Inference for real-time voice agents.

inference model-release agents

High signal Matched: inference, model, agents

Nota AI · korea · 2026-03-31

The Real Reason TurboQuant Shook the Market: AI Optimization Has Gone Mainstream

Score 46

  Jaehoon Lee Technical Content Manager, Nota AI   In March, a single official announcement from Google Research rocked trillions of won in the market capitalization of U.S. infrastructure and semiconductor stocks. The catalyst:...

inference serving kv-cache benchmark hardware model-release research training fine-tuning quantization agents frontier-model

High signal Matched: inference, serving, generation, throughput, kv cache, benchmark, performance, cost, b200, blackwell, introducing, model, fp8, research, training, fine-tuning, quantization, quantized, agent, agentic, frontier model

Together AI · inference-infra · 2026-03-31

Aurora

Score 12

1.25x over a well-trained static speculator. Aurora is an open-source RL framework that turns speculative decoding from a one-time offline setup into a self-improving system that learns from every request it serves.

inference speculative-decoding open-source

High signal Matched: decoding, speculative decoding, open-source

Modal · inference-infra · 2026-03-26

Runway chooses Modal to power real-time inference for Runway Characters

Score 10

Modal is proud to power real-time inference for Runway Characters.

High signal Matched: inference

Modal · inference-infra · 2026-03-25

How Doppel eliminated ML infrastructure tax with Modal

Score 8

How Modal helped the ML team at Doppel parallelize experimentation and scale inference.

High signal Matched: inference

Nota AI · korea · 2026-03-23

[GTC 2026 Recap] The Trillion-Dollar Inference Race Begins: How Nota AI Fills the Gap

Score 42

  Jaehoon Lee Technical Content Manager, Nota AI   GTC has evolved far beyond a technology conference, drawing attention from global economies and financial markets alike. This year, CEO Jensen Huang took the stage in his tradema...

inference serving kernel cuda kv-cache benchmark hardware model-release research cloud training long-context agents open-source

High signal Matched: inference, prefill, generation, throughput, cuda, kv cache, performance, latency, cost, gpu, npu, launch, model, research, cloud, training, long-context, context window, agent, agents, agentic, open-source

NVIDIA Technical Blog · hardware · 2026-03-23

Deploying Disaggregated LLM Inference Workloads on Kubernetes

Score 18

As large language model (LLM) inference workloads grow in complexity, a single monolithic serving process starts to hit its limits. Prefill and decode stages...

inference serving model-release

High signal Matched: inference, serving, prefill, model

Nota AI · korea · 2026-03-20

GenAI Everywhere: The Future of Edge AI Optimization with the New NetsPresso®

Score 26

  NP Product Team, Nota AI   The role of Edge AI is rapidly expanding.Offline voice assistants now carry on conversations in our daily lives, vehicles infer routes in real time, and smartphones generate images without a network c...

inference kv-cache moe benchmark model-release research korea quantization

High signal Matched: inference, kv cache, moe, benchmark, performance, latency, cost, model, research, seoul, quantization

Modular · inference-infra · 2026-03-19

Modular 26.2: State-of-the-Art Image Generation and Upgraded AI Coding with Mojo

Score 10

Modular 26.2: State-of-the-Art Image Generation and Upgraded AI Coding with Mojo

High signal Matched: generation

Together AI · inference-infra · 2026-03-17

Mamba-3

Score 10

Meet Mamba-3: the SSM built for inference. Faster than Transformers at decode, stronger than Mamba-2, and open-source from day one.

inference open-source

High signal Matched: inference, open-source

NVIDIA Technical Blog · hardware · 2026-03-16

How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale

Score 16

Reasoning models are growing rapidly in size and are increasingly being integrated into agentic AI workflows that interact with other models and external tools....

inference distributed agents

High signal Matched: inference, multi-node, agentic

NVIDIA Technical Blog · hardware · 2026-03-16

Inside NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin Platform

Score 20

NVIDIA Groq 3 LPX is a new rack-scale inference accelerator for the NVIDIA Vera Rubin platform, designed for the low-latency and large-context demands of...

inference benchmark hardware

High signal Matched: inference, latency, accelerator

Together AI · inference-infra · 2026-03-16

Together AI at NVIDIA GTC 2026: Explore our latest innovations across research and products

Score 14

Together AI arrives at NVIDIA GTC 2026 with new launches in inference, agents, voice AI, and open models — plus technical sessions from its research and engineering leaders.

inference research agents

High signal Matched: inference, research, agents

Nota AI · korea · 2026-03-13

NotaMoEQuantization: An MoE-Specific Quantization Method for Solar-Open-100B

Score 62

  Hancheol Park, Ph. D. AI Research Engineer, Nota AI Tairen PiaoAI Research Engineer, Nota AI Tae-Ho KimCTO & Co-Founder, Nota AI ✔️ Resource : The official quantized model of Solar-Open-100B, which passed the first round of Sout...

inference serving moe benchmark hardware model-release research korea training quantization evals long-context open-source

High signal Matched: inference, serving, prefill, generation, throughput, moe, router, benchmark, performance, latency, ttft, tpot, blackwell, release, model, weights, open model, research, evaluation, korea, korean, upstage, training, post-training, quantization, quantized, int4, evaluate, benchmarks, mmlu, long-context

BAIR · research · 2026-03-13

Identifying Interactions at Scale for LLMs

Score 18

--> Understanding the behavior of complex machine learning systems, particularly Large Language Models (LLMs), is a critical challenge in modern artificial intelligence. Interpretability research aims to make the decision-making process mo...

inference serving benchmark model-release research training evals long-context rag

High signal Matched: inference, serving, decoding, performance, cost, model, research, training, evaluate, mmlu, long-context, rag

NVIDIA Technical Blog · hardware · 2026-03-13

Scale Synthetic Data and Physical AI Reasoning with NVIDIA Cosmos World Foundation Models

Score 10

The next generation of AI-driven robots like humanoids and autonomous vehicles depends on high-fidelity, physics-aware training data. Without diverse and...

inference training

High signal Matched: generation, training

vLLM Project · open-source · 2026-03-13

P-EAGLE: Faster LLM inference with Parallel Speculative Decoding in vLLM

Score 26

EAGLE is the state-of-the-art method for speculative decoding in large language model (LLM) inference, but its autoregressive drafting creates a hidden bottleneck: the more tokens that you...

inference speculative-decoding model-release

High signal Matched: inference, decoding, speculative decoding, eagle, model

SqueezeBits · korea · 2026-03-11

Reliable & Scalable Synthetic Data for Physical AI (Part 2): Making Cosmos 3.1 x Faster for Production

Score 12

Explore why Physical AI deployment needs synthetic data at scale with Squeezebits' research and discover how to overcome inference bottlenecks to accelerate Roboost Agent.

inference research agents

High signal Matched: inference, research, agent

Together AI · inference-infra · 2026-03-11

Together AI Brings NVIDIA Nemotron 3 to Developers on Day 0

Score 10

NVIDIA Nemotron 3 Super is now available on Together AI Dedicated Inference, delivering efficient multi-agent reasoning, a 1M-token context window, and production-grade deployment on managed infrastructure.

inference long-context agents

High signal Matched: inference, context window, agent

Together AI · inference-infra · 2026-03-05

Key research and product announcements at the AI Native Conf

Score 18

At AI Native Conf, Together AI announced breakthroughs across kernels, RL, and inference optimization — including FlashAttention-4, ThunderAgent, and together.compile. Research that ships to production. That's the AI Native Cloud.

inference kernel research cloud

High signal Matched: inference, flashattention, research, cloud

Together AI · inference-infra · 2026-03-04

Cache-aware prefill–decode disaggregation (CPD) for up to 40% faster long-context LLM serving

Score 20

Serving long prompts doesn't have to mean slow responses. Learn how Together AI's CPD architecture separates warm and cold inference workloads to deliver 40% higher throughput and dramatically lower time-to-first-token for long-context LLM...

inference serving benchmark long-context

High signal Matched: inference, serving, prefill, throughput, long-context

AIBrix · open-source · 2026-03-03

AIBrix v0.6.0 Release: Envoy Sidecar, Mixed LLM Workloads Routing, Routing Profiles, LoRA Delivery & New APIs

Score 28

🚀 AIBrix v0.6.0 Release Today we’re excited to announce AIBrix v0.6.0, a release that expands how you deploy and route inference traffic. Key highlights include: Envoy Sidecar Support – Run Envoy alongside the gateway-plugin without...

inference model-release fine-tuning rag api

High signal Matched: inference, prefill, release, model, lora, rerank, api, openai-compatible

vLLM Project · open-source · 2026-02-27

Beyond Porting: How vLLM Orchestrates High-Performance Inference on AMD ROCm

Score 20

For a long time, enabling AMD support meant "porting"; i.e. just making code run. That era is over.

inference benchmark hardware

High signal Matched: inference, performance, rocm

Nota AI · korea · 2026-02-26

ERGO: Efficient High-Resolution Visual Understanding for Vision-Language Models

Score 24

  Jewon Lee | Wooksu Shin | Seungmin Yang | Ki-Ung Song | Donguk Lim | Jaeyeon Kim | Tae-Ho Kim |  Bo-Kyeong KimEdgeFM Team, Nota AI ✔️ Resources for more information: GitHub, ArXiv, Project Page, Demo.✔️ Accepted at ICLR 2026. &...

inference speculative-decoding benchmark model-release research training evals

High signal Matched: inference, generation, verification, benchmark, performance, latency, cost, model, arxiv, evaluation, training, post-training, benchmarks

Together AI · inference-infra · 2026-02-19

Consistency diffusion language models: Up to 14x faster inference without sacrificing quality

Score 14

Standard diffusion language models can't use KV caching and need too many refinement steps to be practical. CDLM fixes both with a post-training recipe that enables exact block-wise KV caching and trajectory-consistent step reduction — del...

inference benchmark training

High signal Matched: inference, latency, training, post-training

Replicate · inference-infra · 2026-02-18

Recraft V4: image generation with design taste

Score 8

Recraft V4 generates art-directed images — and actual editable SVGs — with strong composition, accurate text rendering, and what the Recraft team calls "design taste." Four models are available on Replicate now.

High signal Matched: generation

Together AI · inference-infra · 2026-02-12

Introducing Dedicated Container Inference: Delivering 2.6x faster inference for custom AI models

Score 16

Together AI launches production-grade orchestration for custom AI models with 1.4x–2.6x faster inference.

inference model-release

High signal Matched: inference, introducing

llm-d · open-source · 2026-02-10

Native KV Cache Offloading to Any Filesystem with llm-d

Score 20

llm-d's new filesystem backend offloads KV cache to shared storage, enabling cross-replica reuse and up to 16.8x faster TTFT — scaling inference throughput without GPU or CPU memory limits.

inference serving kv-cache benchmark hardware

High signal Matched: inference, throughput, kv cache, ttft, gpu

llm-d · open-source · 2026-02-04

llm-d 0.5: Sustaining Performance at Scale

Score 16

llm-d v0.5 introduces hierarchical KV-cache offloading, LoRA-aware scheduling, UCCL networking, and scale-to-zero autoscaling for sustained inference performance at scale.

inference benchmark fine-tuning

High signal Matched: inference, performance, lora

vLLM Project · open-source · 2026-02-03

Driving vLLM WideEP and Large-Scale Serving Toward Maturity on Blackwell (Part I)

Score 24

Building on our previous work achieving 2.2k tok/s/H200 decode throughput with wide-EP, the vLLM team has continued performance optimization efforts targeting NVIDIA's GB200 platform. This blog...

inference serving benchmark hardware

High signal Matched: serving, throughput, performance, h200, gb200, blackwell

Together AI · inference-infra · 2026-02-02

Fine-tuning open LLM judges to outperform GPT-5.2

Score 14

Fine-tuned open-source LLM judges can outperform GPT-5.2 at evaluating model outputs. Using Direct Preference Optimization on just 5,400 preference pairs, we trained GPT-OSS 120B to beat GPT-5.2 on human preference alignment—at 15x lower c...

inference benchmark model-release fine-tuning evals open-source

High signal Matched: inference, cost, model, fine-tuning, evaluating, open-source, oss

vLLM Project · open-source · 2026-01-31

Streaming Requests & Realtime API in vLLM

Score 12

Large language model inference has traditionally operated on a simple premise: the user submits a complete prompt (request), the model processes it, and returns a response (either streaming or at...

inference model-release api

High signal Matched: inference, model, api

Together AI · inference-infra · 2026-01-26

DSGym: A holistic framework for evaluating and training data science agents

Score 18

Introducing DSGym—a holisti evaluation and training framework for LLM-based data science agents. Features 90+ bioinformatics tasks, 92 Kaggle competitions, and synthetic trajectory generation. Our 4B model achieves state-of-the-art perform...

inference benchmark model-release research training evals agents open-source

High signal Matched: generation, performance, introducing, model, evaluation, training, evaluating, agents, open-source

Together AI · inference-infra · 2026-01-22

Optimizing inference speed and costs: Lessons learned from large-scale deployments

Score 22

Learn how to reduce inference latency without massive cost using proven inference optimization tactics — improving throughput, GPU utilization, and cost efficiency while balancing throughput vs. latency tradeoffs.

inference serving benchmark hardware

High signal Matched: inference, throughput, latency, cost, gpu

Google Research · big-tech · 2026-01-14

Next generation medical image interpretation with MedGemma 1.5 and medical speech to text with MedASR

Score 8

Generative AI

High signal Matched: generation

Together AI · inference-infra · 2026-01-13

Learn how Cursor partnered with Together AI to deliver real-time, low-latency inference at scale

Score 24

Together AI teamed with Cursor to build the real-time inference stack that keeps in-editor agents fast and reliable. They productionized NVIDIA Blackwell (B200/GB200), tuning ARM hosts, kernels, and FP4/TensorRT quantization for low latenc...

inference benchmark hardware model-release quantization agents

High signal Matched: inference, latency, b200, gb200, blackwell, model, quantization, agents

vLLM Project · open-source · 2026-01-08

Inside vLLM’s New KV Offloading Connector: Smarter Memory Transfer for Maximizing Inference Throughput

Score 18

In this post, we will describe the new KV cache offloading feature that was introduced in vLLM 0.11.0. We will focus on offloading to CPU memory (DRAM) and its benefits to improving overall...

inference serving kv-cache benchmark

High signal Matched: inference, throughput, kv cache

SqueezeBits · korea · 2026-01-07

Intel® Gaudi® Hands-on Workshop | A Recap of the Gaudi Workshop with SqueezeBits x Lablup

Score 12

A recap of the Intel® Gaudi® hands-on workshop co-hosted by SqueezeBits and Lablup. AI model compression, fine-tuning, and vLLM serving on Gaudi® hardware with Backend.AI.

inference serving model-release fine-tuning

High signal Matched: serving, model, fine-tuning

SqueezeBits · korea · 2025-12-24

Introducing rebellions ATOM™-MAX

Score 24

Introducing ATOM™-Max, rebellions’ next-generation NPU designed for high-performance AI inference. Learn how its runtime, profiling tools, and PyTorch-native integrations enable developers to run and serve models efficiently without sacrif...

inference serving benchmark hardware model-release korea

High signal Matched: inference, generation, serve, performance, npu, introducing, rebellions

Nota AI · korea · 2025-12-19

NVIDIA Blackwell; The Impact of NVFP4 For LLM Inference

Score 74

  Seungmin YangEdgeFM Lead, Nota AI On this page ▾ SummaryWith the introduction of NVFP4—a new 4-bit floating point data type in NVIDIA’s Blackwell GPU architecture—LLM inference achieves markedly improved efficiency.Blackwell’s NVFP4...

inference serving kernel cuda distributed benchmark hardware model-release research training quantization evals rag

High signal Matched: inference, serving, decoding, prefill, generation, token generation, throughput, kernel, gemm, cutlass, distributed, benchmark, performance, latency, ttft, tpot, tokens/sec, cost, gpu, blackwell, launch, model, weights, fp8, research, training, post-training, quantization, quantized, awq, benchmarks, mmlu, retrieval

vLLM Project · open-source · 2025-12-17

vLLM Large Scale Serving: DeepSeek @ 2.2k tok/s/H200 with Wide-EP

Score 16

In v0.11.0, the last code from vLLM V0 engine was removed, marking the complete migration to the improved V1 engine architecture. This achievement would not have been possible without vLLM’s...

inference serving hardware

High signal Matched: serving, h200

vLLM Project · open-source · 2025-12-15

Encoder Disaggregation for Scalable Multimodal Model Serving

Score 18

Modern Large Multimodal Models (LMMs) introduce a unique serving-time bottleneck: before any text generation can begin, all images must be processed by a visual encoder (e.g., ViT). This encoder...

inference serving model-release

High signal Matched: serving, generation, model

vLLM Project · open-source · 2025-12-13

vLLM Router: A High-Performance and Prefill/Decode Aware Load Balancer for Large-scale Serving

Score 26

Efficiently managing request distribution across a fleet of model replicas is a critical requirement for large-scale, production vLLM deployments. Standard load balancers often fall short as they...

inference serving moe benchmark model-release

High signal Matched: serving, prefill, router, performance, model

vLLM Project · open-source · 2025-12-13

Diving into speculative decoding training support for vLLM with Speculators v0.3.0

Score 24

- Speculative decoding serves as an optimization to improve inference performance; however, training a unique draft model for each LLM can be difficult and time-consuming, while production-ready...

inference speculative-decoding benchmark model-release training

High signal Matched: inference, decoding, speculative decoding, draft model, performance, model, training

SkyPilot · open-source · 2025-12-11

SkyPilot 0.11: Multi-Cloud Pools for Batch Inference, Fast Managed Jobs, Enterprise-Ready at Scale, Programmability

Score 14

Announcing SkyPilot 0.11 with Pools for batch inference, faster managed jobs, and enterprise-scale improvements.

inference cloud

High signal Matched: inference, cloud

vLLM Project · open-source · 2025-12-09

Advancing Low‑Bit Quantization for LLMs: AutoRound x LLM Compressor

Score 10

Achieve faster, more efficient LLM serving without sacrificing accuracy!

inference serving quantization

High signal Matched: serving, quantization

Together AI · inference-infra · 2025-12-03

Introducing AutoJudge: Streamlined inference acceleration via automated dataset curation

Score 20

AutoJudge accelerates LLM inference by identifying which token mismatches actually matter. Using self-supervised learning to train a lightweight classifier, it accepts up to 40 draft tokens per cycle—delivering 1.5–2× speedups over standar...

inference speculative-decoding model-release

High signal Matched: inference, decoding, speculative decoding, introducing

SkyPilot · open-source · 2025-12-02

Batch Inference for Documents with DeepSeek-OCR using a Pool of Workers on any Clouds

Score 10

Scale document OCR batch inference for RAG on multiple clouds and Kubernetes clusters using SkyPilot Pool.

High signal Matched: inference, rag

llm-d · open-source · 2025-12-02

llm-d 0.4: Achieve SOTA Performance Across Accelerators

Score 30

llm-d v0.4 delivers 50% lower latency for MoE models via speculative decoding, expands TPU and XPU support, and adds prefix cache offloading for faster TTFT.

inference kv-cache speculative-decoding moe benchmark hardware frontier-model

High signal Matched: decoding, prefix cache, speculative decoding, moe, performance, latency, ttft, tpu, sota

Together AI · inference-infra · 2025-12-01

Together AI delivers fastest inference for the top open-source models

Score 20

Together AI achieves up to 2x faster inference for top open-source models like Qwen, DeepSeek, and Kimi through GPU optimization, advanced speculative decoding, and FP4 quantization—ranking #1 in speed benchmarks on NVIDIA Blackwell archit...

inference speculative-decoding hardware quantization evals open-source

High signal Matched: inference, decoding, speculative decoding, gpu, blackwell, quantization, benchmarks, open-source

vLLM Project · open-source · 2025-11-30

Announcing vLLM-Omni: Easy, Fast, and Cheap Omni-Modality Model Serving

Score 20

We are excited to announce the official release of vLLM-Omni, a major extension of the vLLM ecosystem designed to support the next generation of AI: omni-modality models.

inference serving model-release

High signal Matched: serving, generation, release, model

AIBrix · open-source · 2025-11-26

PrisKV: A Colocated Tiered KVCache Store for LLM Serving

Score 22

In recent years, large language models (LLMs) such as GPT, DeepSeek, Doubao and Qwen have advanced rapidly and are reshaping a wide range of industries. As the Scaling Law continues to be validated and pushed to its limits, LLM capabilitie...

inference serving benchmark

High signal Matched: inference, serving, generation, throughput, performance, latency, cost

Together AI · inference-infra · 2025-11-25

FLUX.2: Multi-reference image generation now available on Together AI

Score 12

Production-grade image generation with multi-reference consistency, exact brand colors, and reliable text rendering. FLUX.2 from Black Forest Labs, now on Together AI's platform.

High signal Matched: generation

Hugging Face · open-source · 2025-11-25

OVHcloud on Hugging Face Inference Providers 🔥

Score 10

No feed summary available yet.

High signal Matched: inference

vLLM Project · open-source · 2025-11-22

Streamlined multi-node serving with Ray symmetric-run

Score 18

Ray now has a new command: ray symmetric-run. This command makes it possible to launch the same entrypoint command on every node in a Ray cluster, simplifying the workflow to spawn vLLM servers...

inference serving distributed model-release

High signal Matched: serving, multi-node, launch

Modular · inference-infra · 2025-11-20

Modular 25.7: Faster Inference, Safer GPU Programming, and a More Unified Developer Experience

Score 14

Modular 25.7: Faster Inference, Safer GPU Programming, and a More Unified Developer Experience

inference hardware

High signal Matched: inference, gpu

Modal · inference-infra · 2025-11-18

Host overhead is killing your inference efficiency

Score 12

Never block the GPU.

inference hardware

High signal Matched: inference, gpu

Modal · inference-infra · 2025-11-13

How Decagon shipped real-time voice AI on Modal

Score 10

How Decagon and Modal made real-time voice AI possible, combining fine-tuned small models with a re-engineered inference runtime for sub-second latency.

inference benchmark

High signal Matched: inference, latency

AIBrix · open-source · 2025-11-10

AIBrix v0.5.0 Release: Batch API, KVCache v1 Connector, and Enhanced P/D orchestration

Score 22

🚀 AIBrix v0.5.0 Release Today, we’re excited to announce AIBrix v0.5.0, a release that pushes AIBrix closer to a batteries-included control plane for modern LLM workloads. This release introduces an OpenAI-compatible Batch API for hi...

inference benchmark model-release research evals api

High signal Matched: prefill, latency, release, evaluation, api, openai-compatible

Together AI · inference-infra · 2025-11-04

Announcing the fastest inference for realtime voice AI agents

Score 14

Together AI launches the fastest voice AI stack: streaming Whisper STT, serverless open-source TTS (Orpheus & Kokoro), and Voxtral transcription. Sub-second latency for production voice agents.

inference benchmark agents open-source

High signal Matched: inference, latency, agents, open-source

SqueezeBits · korea · 2025-10-31

Winning both speed and quality: How Yetter deals with diffusion models

Score 16

Explore how the Yetter Inference Engine overcomes the limitations of step caching and model distillation for diffusion models. We analyze latency, diversity, quality, and negative-prompt handling to reveal what truly matters for scalable,...

inference benchmark model-release

High signal Matched: inference, generation, latency, model

SqueezeBits · korea · 2025-10-28

[Intel Gaudi] #6. GEMM, Attention, vLLM on Gaudi

Score 20

Explore how Intel’s new Gaudi-3 compares to Gaudi-2, NVIDIA A100, and H100. We analyze real-world GEMM efficiency, attention performance, and LLM serving results to uncover what truly matters for AI inference and training workloads.

inference serving kernel benchmark hardware training

High signal Matched: inference, serving, gemm, performance, h100, training

SkyPilot · open-source · 2025-10-21

Why AWS Batch Doesn't Work for Modern AI Workloads: A Technical Comparison with SkyPilot

Score 10

AWS Batch works well for traditional enterprise batch processing (see their case studies 1 and 2). But AI workloads have different requirements - they’re more interactive, need flexible GPU access, and benefit from simpler iteration...

inference hardware

High signal Matched: inference, gpu

Together AI · inference-infra · 2025-10-21

Expanding Together AI Model Library into multimedia generation with 40+ new image and video models

Score 16

Together AI adds 40+ image & video models, including Sora 2 and Veo 3, to build end-to-end multimodal apps with unified OpenAI-compatible APIs and transparent pricing.

inference model-release api

High signal Matched: generation, model, openai-compatible

Google Research · big-tech · 2025-10-21

A picture's worth a thousand (private) words: Hierarchical generation of coherent synthetic photo albums

Score 8

Generative AI

High signal Matched: generation

Together AI · inference-infra · 2025-10-10

AdapTive-LeArning Speculator System (ATLAS): A New Paradigm in LLM Inference via Runtime-Learning Accelerators

Score 20

LLM inference that gets faster as you use it. Our runtime-learning accelerator adapts continuously to your workload, delivering 500 TPS on DeepSeek-V3.1, a 4x speedup over baseline performance without manual tuning.

inference moe benchmark hardware

High signal Matched: inference, deepseek-v3, performance, accelerator

llm-d · open-source · 2025-10-10

llm-d 0.3: Wider Well-Lit Paths for Scalable Inference

Score 20

llm-d v0.3 adds Google TPU and Intel XPU support, wide expert parallelism at 2.2k tokens/sec per GPU, predicted latency scheduling, and Inference Gateway GA.

inference benchmark hardware

High signal Matched: inference, latency, tokens/sec, gpu, tpu

Google Research · big-tech · 2025-10-03

A collaborative approach to image generation

Score 8

Generative AI

High signal Matched: generation

SqueezeBits · korea · 2025-10-02

Yetter, the GenAI API service: AI Optimization, Out of the Box

Score 14

Meet 'Yetter': the generative AI API service built for speed, efficiency, and scalability. Powered by our optimization inference engine, it delivers reliable image, video, and future LLM services at a fraction of the cost.

inference benchmark api

High signal Matched: inference, cost, api

llm-d · open-source · 2025-09-24

KV-Cache Wins You Can See: From Prefix Caching in vLLM to Distributed Scheduling with llm-d

Score 18

See how llm-d's precise KV-cache aware scheduling delivers 57x faster responses and 2x throughput in production distributed LLM inference benchmarks.

inference serving distributed benchmark evals

High signal Matched: inference, throughput, distributed, benchmarks

Hugging Face · open-source · 2025-09-19

Scaleway on Hugging Face Inference Providers 🔥

Score 10

No feed summary available yet.

High signal Matched: inference

Hugging Face · open-source · 2025-09-17

Public AI on Hugging Face Inference Providers 🔥

Score 10

No feed summary available yet.

High signal Matched: inference

SqueezeBits · korea · 2025-09-16

Guided Decoding Performance on vLLM and SGLang

Score 16

The guide to LLM guided decoding! This deep-dive benchmark compares XGrammar and LLGuidance on vLLM and SGLang to help you find the optimal setup for generating structured output based on your use case.

inference benchmark

High signal Matched: decoding, benchmark, performance

Together AI · inference-infra · 2025-09-15

Improved Batch Inference API: Enhanced UI, Expanded Model Support, and 3000× Rate Limit Increase

Score 18

Our new Batch Inference API makes large-scale AI workloads simpler, faster, and cheaper. With a streamlined UI, universal model support, and 3000× higher rate limits—now up to 30B tokens—you can process massive datasets at half the cost of...

inference benchmark model-release api

High signal Matched: inference, cost, model, api

Google Research · big-tech · 2025-09-12

Speculative cascades — A hybrid approach for smarter, faster LLM inference

Score 8

Generative AI

High signal Matched: inference

Together AI · inference-infra · 2025-09-09

Announcing General Availability of Together Instant Clusters, offering ready to use, self-service NVIDIA GPUs

Score 18

Together AI launches Instant Clusters: self-service GPU clusters with NVIDIA H100/B200, ready in minutes for training or inference at any scale.

inference hardware training

High signal Matched: inference, gpu, h100, b200, training

Replicate · inference-infra · 2025-09-08

Torch compile caching for inference speed

Score 8

Cache your compiled models for faster boot and inference times

High signal Matched: inference

llm-d · open-source · 2025-09-03

Intelligent Inference Scheduling with llm-d

Score 16

Learn how llm-d's intelligent inference scheduling uses prefix-aware, load-balanced routing to maximize LLM throughput and minimize latency on Kubernetes.

inference serving benchmark

High signal Matched: inference, throughput, latency

SqueezeBits · korea · 2025-08-26

Disaggregated Inference on Apple Silicon: NPU prefill and GPU decode

Score 22

In this article, we introduce how to run LLMs efficiently on Apple Silicon with disaggregated inference technique.

inference hardware

High signal Matched: inference, prefill, gpu, npu

Together AI · inference-infra · 2025-08-21

How Together AI Uses AI Agents to Automate Complex Engineering Tasks: Lessons from Developing Efficient LLM Inference Systems

Score 16

Build AI agents for complex, long-running engineering tasks. Learn key patterns from a case study: accelerating LLM inference with speculative decoding.

inference speculative-decoding agents

High signal Matched: inference, decoding, speculative decoding, agents

AIBrix · open-source · 2025-08-05

AIBrix v0.4.0 Release: P/D Disaggregation and Expert Parallelism Support, KVCache v1 Connector, KV Event Synchronization & Multi‑Engine Support

Score 20

AIBrix is a composable, cloud‑native LLM inference infrastructure designed to deliver high performance and low cost at scale. We now present a major update in a new release - v0.4.0. This release tackles key bottlenecks in orchestration an...

inference serving benchmark hardware model-release cloud

High signal Matched: inference, prefill, generation, token generation, throughput, performance, cost, gpu, release, cloud

Modular · inference-infra · 2025-08-05

Modular Platform 25.5: Introducing Large Scale Batch Inference

Score 14

Modular Platform 25.5: Introducing Large Scale Batch Inference

inference model-release

High signal Matched: inference, introducing

SqueezeBits · korea · 2025-08-04

Vocabulary Trimming: An Easy and Effective Method for SLM Acceleration

Score 10

Trimming large multilingual vocabularies in Small Language Models (SLM) is a simple, low-risk way to boost efficiency to its limit. It accelerates the model inference significantly while keeping accuracy almost unchanged.

inference model-release

High signal Matched: inference, model

Modular · inference-infra · 2025-07-31

SF Compute and Modular Partner to Revolutionize AI Inference Economics

Score 10

SF Compute and Modular Partner to Revolutionize AI Inference Economics

High signal Matched: inference

Hugging Face · open-source · 2025-07-23

Fast LoRA inference for Flux with Diffusers and PEFT

Score 10

No feed summary available yet.

inference fine-tuning

High signal Matched: inference, lora

Together AI · inference-infra · 2025-07-17

Together AI Delivers Top Speeds for DeepSeek-R1-0528 Inference on NVIDIA Blackwell

Score 18

Together AI inference is now among the world’s fastest, most capable platforms for running open-source reasoning models like DeepSeek-R1 at scale, thanks to our new inference engine designed for NVIDIA HGX B200.

inference hardware open-source

High signal Matched: inference, b200, blackwell, open-source

Nota AI · korea · 2025-07-10

Video Self-Distillation for Single-Image Encoders: Learning Temporal Priors from Unlabeled Video

Score 20

  Marcel Simon, Ph. D.ML Researcher, Nota AI GmbH Tae-Ho KimCTO & Co-Founder, Nota AI Seul-Ki Yeom, Ph. D.Research Lead, Nota AI GmbH   SummaryProposes a simple next-frame prediction task using unlabeled video to enhance sing...

inference benchmark model-release research training fine-tuning evals

High signal Matched: inference, performance, model, paper, research, training, fine-tuning, benchmarks

Hugging Face · open-source · 2025-07-10

Asynchronous Robot Inference: Decoupling Action Prediction and Execution

Score 10

No feed summary available yet.

High signal Matched: inference

Modal · inference-infra · 2025-07-02

How we used evals and inference-time compute scaling to generate beautiful QR codes that actually work

Score 10

There's only one playbook for improving generative applications. Read about it here.

inference evals

High signal Matched: inference, evals

BAIR · research · 2025-07-01

Whole-Body Conditioned Egocentric Video Prediction

Score 10

.modal { display: none; position: fixed; z-index: 9999; padding-top: 50px; left: 0; top: 0; width: 100%; height: 100%; overflow: auto; background-color: rgba(0,0,0,0.9); } .modal-content { margin: auto; display: block; max-width: 90%; max-...

inference benchmark model-release research training evals agents

High signal Matched: inference, generation, performance, model, paper, arxiv, evaluation, training, evaluate, agent, agents

llm-d · open-source · 2025-06-25

llm-d Community Update - June 2025

Score 10

Help shape llm-d's future: Take our 5-minute community survey, subscribe to our YouTube channel, and access exclusive resources for LLM serving innovation.

inference serving

High signal Matched: serving

Hugging Face · open-source · 2025-06-16

Groq on Hugging Face Inference Providers 🔥

Score 10

No feed summary available yet.

High signal Matched: inference

Hugging Face · open-source · 2025-06-12

Featherless AI on Hugging Face Inference Providers 🔥

Score 10

No feed summary available yet.

High signal Matched: inference

Hugging Face · open-source · 2025-06-04

Real-Time AI Sound Generation on Arm: A Personal Tool for Creative Freedom

Score 10

No feed summary available yet.

High signal Matched: generation

llm-d · open-source · 2025-06-03

llm-d Week 1 Project News Round-Up

Score 12

llm-d hits 1000 GitHub stars! Week 1-2 round-up covers KVTransfer Protocol, InferenceModel API updates, and community resources for LLM inference developers.

High signal Matched: inference, api

Replicate · inference-infra · 2025-05-22

Generate incredible images with Google's Imagen 4

Score 8

Google's flagship image generation model, Imagen 4, is now available for you to try on Replicate. Create images with fine detail, versatile styles, and improved typography.

inference model-release

High signal Matched: generation, model

AIBrix · open-source · 2025-05-22

AIBrix v0.3.0 Release: KVCache Offloading, Prefix Cache, Fairness Routing, and Benchmarking Tools

Score 24

AIBrix is a composable, cloud-native AI infrastructure toolkit designed to power scalable and cost-effective large language model (LLM) inference. As production demands for memory-efficient and latency-aware LLM services continue to grow,...

inference kv-cache benchmark model-release cloud

High signal Matched: inference, prefix cache, latency, cost, release, model, cloud

llm-d · open-source · 2025-05-20

Announcing the llm-d community!

Score 20

Introducing llm-d: Kubernetes-native distributed LLM inference with KV-cache routing, disaggregated serving, and SOTA performance per dollar. Built on vLLM.

inference serving distributed benchmark model-release frontier-model

High signal Matched: inference, serving, distributed, performance, introducing, sota

llm-d · open-source · 2025-05-20

llm-d Press Release

Score 20

Red Hat launches llm-d: Open source distributed AI inference platform backed by NVIDIA, Google Cloud, IBM. Scale generative AI with intelligent routing on Kubernetes.

inference distributed model-release cloud open-source

High signal Matched: inference, distributed, release, cloud, open source

Hugging Face · open-source · 2025-05-13

Blazingly fast whisper transcriptions with Inference Endpoints

Score 10

No feed summary available yet.

High signal Matched: inference

Together AI · inference-infra · 2025-05-12

Boosting DeepSeek-R1’s Speed with Customized Speculative Decoding

Score 16

No feed summary available yet.

inference speculative-decoding

High signal Matched: decoding, speculative decoding

Nota AI · korea · 2025-05-07

Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features</span#x3E;

Score 28

  Jewon Lee | Ki-Ung Song | Seungmin Yang | Donguk Lim | Jaeyeon Kim | Wooksu Shin | Bo-Kyeong Kim | Tae-Ho KimEdgeFM Team, Nota AI Yong Jae Lee, Ph. D.Associate Professor, UW-Madison   SummaryOur method, Trimmed-Llama, reduces t...

inference kv-cache benchmark model-release research training evals open-source

High signal Matched: inference, generation, kv cache, benchmark, performance, latency, model, weights, research, training, benchmarks, open-source

Together AI · inference-infra · 2025-05-05

From AWS to Together Dedicated Endpoints: Arcee AI's journey to greater inference flexibility

Score 12

No feed summary available yet.

High signal Matched: inference

Modal · inference-infra · 2025-04-24

How Lemon Slice built real-time generative video with Modal and Daily

Score 8

Modal + Daily + Pipecat is the best-in-class infra stack for real-time inference pipelines.

High signal Matched: inference

Hugging Face · open-source · 2025-04-16

Prefill and Decode for Concurrent Requests - Optimizing LLM Performance

Score 14

No feed summary available yet.

inference benchmark

High signal Matched: prefill, performance

Hugging Face · open-source · 2025-04-16

Cohere on Hugging Face Inference Providers 🔥

Score 10

No feed summary available yet.

High signal Matched: inference

BAIR · research · 2025-04-08

Repurposing Protein Folding Models for Generation with Latent Diffusion

Score 20

PLAID is a multimodal generative model that simultaneously generates protein 1D sequence and 3D structure, by learning the latent space of protein folding models. The awarding of the 2024 Nobel Prize to AlphaFold2 marks an important moment...

inference benchmark model-release research training rag

High signal Matched: inference, generation, cost, model, weights, research, training, retrieval

Nota AI · korea · 2025-04-08

UniForm: A Reuse Attention Mechanism for Efficient Transformers on Resource-Constrained Edge Devices

Score 24

  Seul-Ki Yeom, Ph. D. Research Lead, Nota AI GmbH Tae-Ho KimCTO & Co-Founder, Nota AI   SummaryDelivers real-time AI performance on edge devices such as smartphones, IoT devices, and embedded systems.Introduces a novel "Reus...

inference kernel benchmark model-release research evals

High signal Matched: inference, kernel, benchmark, performance, cost, introducing, model, paper, research, benchmarks

SqueezeBits · korea · 2025-04-02

[Intel Gaudi] #5. FLUX.1 on Gaudi-2

Score 8

This article discusses inference efficiency when running the FLUX.1 models on Intel Gaudi-2 hardware.

High signal Matched: inference

Hugging Face · open-source · 2025-03-28

🚀 Accelerating LLM Inference with TGI on Intel Gaudi

Score 10

No feed summary available yet.

High signal Matched: inference

Hugging Face · open-source · 2025-03-21

The New and Fresh analytics in Inference Endpoints

Score 10

No feed summary available yet.

High signal Matched: inference

SkyPilot · open-source · 2025-03-20

Large-Scale AI Batch Inference: 9x Faster Embedding Generation

Score 16

How to accelerate distributed embedding generation? Use the "forgotten" regions.

inference distributed

High signal Matched: inference, generation, distributed

AIBrix · open-source · 2025-03-10

DeepSeek-R1 671B multi-host Deployment in AIBrix

Score 20

This blog post introduces deploying DeepSeek R1 using AIBrix. DeepSeek-R1 demonstrates remarkable proficiency in reasoning tasks through step-by-step training process. It features 671B total parameters with 37B active parameters, and 128k...

inference distributed benchmark model-release training long-context

High signal Matched: inference, distributed, benchmark, model, weights, training, context length

Hugging Face · open-source · 2025-03-07

LLM Inference on Edge: A Fun and Easy Guide to run LLMs via React Native on your Phone!

Score 10

No feed summary available yet.

High signal Matched: inference

Replicate · inference-infra · 2025-03-05

Wan2.1: generate videos with an API

Score 10

Wan2.1 is the most capable open-source video generation model, producing coherent and high-quality outputs. Learn how to run it in the cloud with a single line of code.

inference model-release cloud api open-source

High signal Matched: generation, model, cloud, api, open-source

SkyPilot · open-source · 2025-02-26

Using DeepSeek R1 for RAG: Do's and Don'ts

Score 10

DeepSeek R1 has shown great reasoning capability when it is firstly released. In this blog post, we detail our learnings in using DeepSeek R1 to build a Retrieval-Augmented Generation (RAG) system, tailored for legal documents. We choose l...

inference research rag

High signal Matched: generation, research, rag, retrieval-augmented generation, retrieval

Nota AI · korea · 2025-02-25

A Study on Detecting LLM-Generated Multilingual Content

Score 18

  Hancheol Park, Ph. D.AI Research Engineer, Nota AI Geonmin Kim, Ph. D.AI Research Engineer, Nota AI Jaeyeon KimAI Research Engineer, Nota AI   SummaryIn this study, we propose a method for determining whether given multilingual...

inference benchmark model-release research training fine-tuning

High signal Matched: generation, performance, model, paper, research, training, fine-tuning

Hugging Face · open-source · 2025-02-24

Remote VAEs for decoding with Inference Endpoints 🤗

Score 14

No feed summary available yet.

High signal Matched: inference, decoding

AIBrix · open-source · 2025-02-21

Introducing AIBrix: Cost-Effective and Scalable Control Plane for vLLM

Score 26

Open-source large language models (LLMs) like LLaMA, Deepseek, Qwen and Mistral etc have surged in popularity, offering enterprises greater flexibility, cost savings, and control over their AI deployments. These models have empowered organ...

inference benchmark model-release agents open-source

High signal Matched: inference, generation, latency, cost, introducing, model, agents, open-source

AIBrix · open-source · 2025-02-19

AIBrix v0.2.0 Release: Distributed KV Cache, Orchestration and Heterogeneous GPU Support

Score 42

We’re excited to announce the v0.2.0 release of AIBrix! Building on feedback from v0.1.0 production adoption and user interest, this release introduces several new features to enhance performance and usability. Extend the vLLM Prefix...

inference serving distributed kv-cache benchmark hardware model-release agents

High signal Matched: inference, serving, prefill, throughput, distributed, multi-node, kv cache, prefix cache, performance, cost, gpu, accelerator, release, agent

Hugging Face · open-source · 2025-02-18

Introducing Three New Serverless Inference Providers: Hyperbolic, Nebius AI Studio, and Novita 🔥

Score 14

No feed summary available yet.

inference model-release

High signal Matched: inference, introducing

Hugging Face · open-source · 2025-02-12

Build awesome datasets for video generation

Score 10

No feed summary available yet.

High signal Matched: generation

Hugging Face · open-source · 2025-01-28

Welcome to Inference Providers on the Hub 🔥

Score 10

No feed summary available yet.

High signal Matched: inference

Hugging Face · open-source · 2025-01-27

State of open video generation models in Diffusers

Score 10

No feed summary available yet.

High signal Matched: generation

SqueezeBits · korea · 2025-01-20

[vLLM vs TensorRT-LLM] #13. Vision-Language Models

Score 8

This article provides a comparative analysis of serving vision-language models on vLLM and TensorRT-LLM.

inference serving

High signal Matched: serving

Hugging Face · open-source · 2025-01-16

Introducing multi-backends (TRT-LLM, vLLM) support for Text Generation Inference

Score 18

No feed summary available yet.

inference model-release

High signal Matched: inference, generation, introducing

Hugging Face · open-source · 2024-12-23

Controlling Language Model Generation with NVIDIA's LogitsProcessorZoo

Score 14

No feed summary available yet.

inference model-release

High signal Matched: generation, model

Hugging Face · open-source · 2024-12-18

Bamba: Inference-Efficient Hybrid Mamba2 Model

Score 14

No feed summary available yet.

inference model-release

High signal Matched: inference, model

SqueezeBits · korea · 2024-12-09

[vLLM vs TensorRT-LLM] #11. Speculative Decoding

Score 14

This article provides a comparative analysis of speculative decoding.

inference speculative-decoding

High signal Matched: decoding, speculative decoding

Hugging Face · open-source · 2024-12-09

Open Preference Dataset for Text-to-Image Generation by the 🤗 Community

Score 10

No feed summary available yet.

High signal Matched: generation

SqueezeBits · korea · 2024-12-05

[vLLM vs TensorRT-LLM] #10 Serving Multiple LoRAs at Once

Score 14

This article provides a comparative analysis of multi-LoRA serving capabilities of vLLM and TensorRT-LLM frameworks.

inference serving fine-tuning

High signal Matched: serving, lora

Hugging Face · open-source · 2024-11-20

Faster Text Generation with Self-Speculative Decoding

Score 18

No feed summary available yet.

inference speculative-decoding

High signal Matched: decoding, generation, speculative decoding

AIBrix · open-source · 2024-11-13

Introducing AIBrix v0.1.0: Building the Future of Scalable, Cost-Effective AI Infrastructure for Large Models

Score 32

In recent years, large language models (LLMs) have revolutionized AI applications, powering solutions in areas like chatbots, automated content generation, and advanced recommendation engines. Services like OpenAI’s have gained significant...

inference kv-cache benchmark hardware model-release cloud open-source

High signal Matched: decoding, prefill, generation, kv cache, performance, cost, gpu, release, introducing, cloud, open-source

Hugging Face · open-source · 2024-10-29

Universal Assisted Generation: Faster Decoding with Any Assistant Model

Score 18

No feed summary available yet.

inference model-release

High signal Matched: decoding, generation, model

Hugging Face · open-source · 2024-10-22

Releasing Outlines-core 0.1.0: structured generation in Rust and Python

Score 10

No feed summary available yet.

High signal Matched: generation

SqueezeBits · korea · 2024-10-11

[vLLM vs TensorRT-LLM] #2. Towards Optimal Batching for LLM Serving

Score 10

This article provides a comparative analysis of vLLM and TensorRT-LLM frameworks, focusing on batching configurations and thoroughly examining the effects of maximum batch size and maximum number of tokens.

inference serving

High signal Matched: serving

Hugging Face · open-source · 2024-10-08

Faster Assisted Generation with Dynamic Speculation

Score 10

No feed summary available yet.

High signal Matched: generation

Replicate · inference-infra · 2024-10-03

FLUX1.1 [pro] is here

Score 10

Black Forest Labs continue to push boundaries with their latest release of FLUX.1 image generation model.

inference model-release

High signal Matched: generation, release, model

SqueezeBits · korea · 2024-10-01

[vLLM vs TensorRT-LLM] #1. An Overall Evaluation

Score 22

This article provides a comparative analysis of vLLM and TensorRT-LLM frameworks for serving LLMs, evaluating their performance based on key metrics like throughput, TTFT, and TPOT to offer insights for practitioners in optimizing LLM depl...

inference serving benchmark research evals

High signal Matched: serving, throughput, performance, ttft, tpot, evaluation, evaluating

Modal · inference-infra · 2024-09-16

Boost your throughput with dynamic batching

Score 14

Learn how we used our new dynamic batching feature to improve throughput and reduce inference costs for the Whisper model with a single line of code!

inference serving benchmark model-release

High signal Matched: inference, throughput, model

Nota AI · korea · 2024-08-02

Deploying an Efficient Vision-Language Model on Mobile Devices

Score 38

  Jaeyeon KimResearch Engineer, Nota AI Geonmin KimResearch Engineer, Nota AI Hancheol ParkTeam Lead of NetsPresso Application, Nota AI   IntroductionRecent large language models (LLMs) have demonstrated unprecedented performance...

inference benchmark model-release research cloud training fine-tuning evals open-source

High signal Matched: decoding, benchmark, performance, latency, tokens/sec, model, arxiv, research, technical report, evaluation, cloud, training, lora, benchmarks, leaderboard, open-source

Hugging Face · open-source · 2024-07-29

Serverless Inference with Hugging Face and NVIDIA NIM

Score 10

No feed summary available yet.

High signal Matched: inference

Hugging Face · open-source · 2024-06-18

BigCodeBench: The Next Generation of HumanEval

Score 10

No feed summary available yet.

High signal Matched: generation

Replicate · inference-infra · 2024-06-14

Push a custom version of Stable Diffusion 3

Score 8

Create your own custom version of Stability's latest image generation model and run it on Replicate via the web or API.

inference model-release api

High signal Matched: generation, model, api

Hugging Face · open-source · 2024-06-04

Faster assisted generation support for Intel Gaudi

Score 10

No feed summary available yet.

High signal Matched: generation

Hugging Face · open-source · 2024-05-29

Benchmarking Text Generation Inference

Score 14

No feed summary available yet.

High signal Matched: inference, generation

Hugging Face · open-source · 2024-05-16

Unlocking Longer Generation with Key-Value Cache Quantization

Score 10

No feed summary available yet.

inference quantization

High signal Matched: generation, quantization

Hugging Face · open-source · 2024-05-01

Powerful ASR + diarization + speculative decoding with Hugging Face Inference Endpoints

Score 18

No feed summary available yet.

inference speculative-decoding

High signal Matched: inference, decoding, speculative decoding

Hugging Face · open-source · 2024-04-29

StarCoder2-Instruct: Fully Transparent and Permissive Self-Alignment for Code Generation

Score 10

No feed summary available yet.

High signal Matched: generation

Hugging Face · open-source · 2024-04-03

Blazing Fast SetFit Inference with 🤗 Optimum Intel on Xeon

Score 10

No feed summary available yet.

High signal Matched: inference

Hugging Face · open-source · 2024-04-02

Bringing serverless GPU inference to Hugging Face users

Score 14

No feed summary available yet.

inference hardware

High signal Matched: inference, gpu

Hugging Face · open-source · 2024-02-29

Text-Generation Pipeline on Intel® Gaudi® 2 AI Accelerator

Score 14

No feed summary available yet.

inference hardware

High signal Matched: generation, accelerator

Modal · inference-infra · 2024-02-21

How Suno shaved 4 months off their launch timeline with Modal

Score 12

Find out how Suno uses Modal to scale inference and batch pre-processing to thousands of GPUs.

inference model-release

High signal Matched: inference, launch

SkyPilot · open-source · 2024-02-20

Introducing SkyServe: 50% Cheaper AI Serving on Any Cloud with High Availability

Score 20

SkyServe: A simple, cost-efficient, multi-region/cloud library for serving GenAI models.

inference serving benchmark model-release cloud

High signal Matched: serving, cost, introducing, cloud

Hugging Face · open-source · 2024-02-01

Hugging Face Text Generation Inference available for AWS Inferentia2

Score 14

No feed summary available yet.

High signal Matched: inference, generation

Hugging Face · open-source · 2024-01-30

Accelerate StarCoder with 🤗 Optimum Intel on Xeon: Q8/Q4 and Speculative Decoding

Score 14

No feed summary available yet.

inference speculative-decoding

High signal Matched: decoding, speculative decoding

Replicate · inference-infra · 2024-01-30

Run Code Llama 70B with an API

Score 8

Code Llama 70B is one of the powerful open-source code generation models. Learn how to run it in the cloud with one line of code.

inference cloud api open-source

High signal Matched: generation, cloud, api, open-source

Hugging Face · open-source · 2024-01-15

Accelerating SD Turbo and SDXL Turbo Inference with ONNX Runtime and Olive

Score 10

No feed summary available yet.

High signal Matched: inference

Hugging Face · open-source · 2024-01-04

Welcome aMUSEd: Efficient Text-to-Image Generation

Score 10

No feed summary available yet.

High signal Matched: generation

SkyPilot · open-source · 2023-12-21

Scaling Mixtral LLM Serving with High GPU Availability and Cost Efficiency

Score 24

A tutorial for serving Mixtral 8x7B model with SkyPilot and SkyServe.

inference serving moe benchmark hardware model-release

High signal Matched: serving, mixtral, cost, gpu, model

Hugging Face · open-source · 2023-12-20

Speculative Decoding for 2x Faster Whisper Inference

Score 18

No feed summary available yet.

inference speculative-decoding

High signal Matched: inference, decoding, speculative decoding

Hugging Face · open-source · 2023-12-05

Optimum-NVIDIA Unlocking blazingly fast LLM inference in just 1 line of code

Score 10

No feed summary available yet.

High signal Matched: inference

Hugging Face · open-source · 2023-12-05

Goodbye cold boot - how we made LoRA Inference 300% faster

Score 10

No feed summary available yet.

inference fine-tuning

High signal Matched: inference, lora

Hugging Face · open-source · 2023-11-07

Make your llama generation time fly with AWS Inferentia2

Score 10

No feed summary available yet.

High signal Matched: generation

Hugging Face · open-source · 2023-10-24

Deploy Embedding Models with Hugging Face Inference Endpoints

Score 10

No feed summary available yet.

High signal Matched: inference

Replicate · inference-infra · 2023-10-17

How to use retrieval augmented generation with ChromaDB and Mistral

Score 10

In this post we'll explore the basics of retrieval augmented generation by creating an example app that uses bge-large-en for embeddings, ChromaDB for vector store, and mistral-7b-instruct for language model generation.

inference model-release rag

High signal Matched: generation, model, retrieval augmented generation, retrieval

Hugging Face · open-source · 2023-10-03

🧨 Accelerating Stable Diffusion XL Inference with JAX on Cloud TPU v5e

Score 18

No feed summary available yet.

inference hardware cloud

High signal Matched: inference, tpu, cloud

Hugging Face · open-source · 2023-10-02

Deploying the AI Comic Factory using the Inference API

Score 10

No feed summary available yet.

High signal Matched: inference, api

Hugging Face · open-source · 2023-09-22

Inference for PROs

Score 10

No feed summary available yet.

High signal Matched: inference

Hugging Face · open-source · 2023-09-13

Introducing Würstchen: Fast Diffusion for Image Generation

Score 14

No feed summary available yet.

inference model-release

High signal Matched: generation, introducing

Hugging Face · open-source · 2023-09-08

Efficient Controllable Generation for SDXL with T2I-Adapters

Score 10

No feed summary available yet.

High signal Matched: generation

Hugging Face · open-source · 2023-08-04

Deploy MusicGen in no time with Inference Endpoints

Score 10

No feed summary available yet.

High signal Matched: inference

Hugging Face · open-source · 2023-08-01

Practical 3D Asset Generation: A Step-by-Step Guide

Score 10

No feed summary available yet.

High signal Matched: generation

Hugging Face · open-source · 2023-07-17

Open-Source Text Generation & LLM Ecosystem at Hugging Face

Score 10

No feed summary available yet.

inference open-source

High signal Matched: generation, open-source

Hugging Face · open-source · 2023-07-04

Deploy LLMs with Hugging Face Inference Endpoints

Score 10

No feed summary available yet.

High signal Matched: inference

SkyPilot · open-source · 2023-06-29

Serving LLM 24x Faster On the Cloud with vLLM and SkyPilot

Score 14

SkyPilot makes the deployment and development of vLLM easy and fast on clouds.

inference serving cloud

High signal Matched: serving, cloud

Hugging Face · open-source · 2023-05-31

Introducing the Hugging Face LLM Inference Container for Amazon SageMaker

Score 18

No feed summary available yet.

inference model-release cloud

High signal Matched: inference, introducing, sagemaker

Hugging Face · open-source · 2023-05-23

Hugging Face and IBM partner on watsonx.ai, the next-generation enterprise studio for AI builders

Score 10

No feed summary available yet.

High signal Matched: generation

Hugging Face · open-source · 2023-05-11

Assisted Generation: a new direction toward low-latency text generation

Score 14

No feed summary available yet.

inference benchmark

High signal Matched: generation, latency

Hugging Face · open-source · 2023-03-28

Fast Inference on Large Language Models: BLOOMZ on Habana Gaudi2 Accelerator

Score 14

No feed summary available yet.

inference hardware

High signal Matched: inference, accelerator

Hugging Face · open-source · 2023-03-28

Accelerating Stable Diffusion Inference on Intel CPUs

Score 10

No feed summary available yet.

High signal Matched: inference

Hugging Face · open-source · 2023-02-15

Zero-shot image-to-text generation with BLIP-2

Score 10

No feed summary available yet.

High signal Matched: generation

Hugging Face · open-source · 2023-02-15

Why we’re switching to Hugging Face Inference Endpoints, and maybe you should too

Score 10

No feed summary available yet.

High signal Matched: inference

Hugging Face · open-source · 2023-01-26

2D Asset Generation: AI for Game Development #4

Score 10

No feed summary available yet.

High signal Matched: generation

Hugging Face · open-source · 2023-01-20

3D Asset Generation: AI for Game Development #3

Score 10

No feed summary available yet.

High signal Matched: generation

Hugging Face · open-source · 2022-12-14

Faster Training and Inference: Habana Gaudi®2 vs Nvidia A100 80GB

Score 10

No feed summary available yet.

inference training

High signal Matched: inference, training

Hugging Face · open-source · 2022-11-21

An overview of inference solutions on Hugging Face

Score 10

No feed summary available yet.

High signal Matched: inference

Hugging Face · open-source · 2022-10-14

Getting Started with Hugging Face Inference Endpoints

Score 10

No feed summary available yet.

High signal Matched: inference

Hugging Face · open-source · 2022-10-12

Optimization story: Bloom inference

Score 10

No feed summary available yet.

High signal Matched: inference

Hugging Face · open-source · 2022-09-16

Incredibly Fast BLOOM Inference with DeepSpeed and Accelerate

Score 10

No feed summary available yet.

High signal Matched: inference

Hugging Face · open-source · 2022-08-11

Deploying 🤗 ViT on Kubernetes with TF Serving

Score 10

No feed summary available yet.

inference serving

High signal Matched: serving

Hugging Face · open-source · 2022-07-27

Faster Text Generation with TensorFlow and XLA

Score 10

No feed summary available yet.

High signal Matched: generation

Hugging Face · open-source · 2022-07-25

Deploying TensorFlow Vision Models in Hugging Face with TF Serving

Score 10

No feed summary available yet.

inference serving

High signal Matched: serving

Hugging Face · open-source · 2022-05-10

Accelerated Inference with Optimum and Transformers Pipelines

Score 10

No feed summary available yet.

High signal Matched: inference

Hugging Face · open-source · 2022-03-16

Accelerate BERT inference with Hugging Face Transformers and AWS Inferentia

Score 10

No feed summary available yet.

High signal Matched: inference

Hugging Face · open-source · 2022-03-11

Guiding Text Generation with Constrained Beam Search in 🤗 Transformers

Score 10

No feed summary available yet.

High signal Matched: generation

Hugging Face · open-source · 2022-01-11

Deploy GPT-J 6B for inference using Hugging Face Transformers and Amazon SageMaker

Score 14

No feed summary available yet.

inference cloud

High signal Matched: inference, sagemaker

Hugging Face · open-source · 2021-11-04

Scaling up BERT-like model Inference on modern CPU - Part 2

Score 14

No feed summary available yet.

inference model-release

High signal Matched: inference, model

Hugging Face · open-source · 2021-06-03

Few-shot learning in practice: GPT-Neo and the 🤗 Accelerated Inference API

Score 10

No feed summary available yet.

High signal Matched: inference, api

Hugging Face · open-source · 2021-04-20

Scaling-up BERT Inference on CPU (Part 1)

Score 10

No feed summary available yet.

High signal Matched: inference

Hugging Face · open-source · 2021-02-10

Retrieval Augmented Generation with Huggingface Transformers and Ray

Score 10

No feed summary available yet.

High signal Matched: generation, retrieval augmented generation, retrieval

Hugging Face · open-source · 2021-01-18

How we sped up transformer inference 100x for 🤗 API customers

Score 10

No feed summary available yet.

High signal Matched: inference, api

Hugging Face · open-source · 2020-03-01

How to generate text: using different decoding methods for language generation with Transformers

Score 14

No feed summary available yet.

High signal Matched: decoding, generation

Replicate · inference-infra · 2026-02-24

How to prompt Seedream 5.0

Score 6

Seedream 5.0 brings multi-step reasoning, example-based editing, and deep domain knowledge to image generation. Here's what you should know.

Watchlist Matched: generation

Replicate · inference-infra · 2025-11-25

Run FLUX.2 on Replicate

Score 6

FLUX.2 brings professional-grade image generation and editing with unprecedented detail, multi-reference support, and enterprise efficiency.

Watchlist Matched: generation

Replicate · inference-infra · 2025-11-20

How to prompt Nano Banana Pro

Score 6

Nano Banana Pro brings powerful new capabilities in image generation and editing. Here are the main prompt tricks you should know.

Watchlist Matched: generation

Replicate · inference-infra · 2025-10-16

How to prompt Veo 3.1

Score 6

Google's Veo 3.1 brings powerful new video generation capabilities including reference images, first/last frame control, and enhanced image-to-video. Here's everything you need to know.

Watchlist Matched: generation

Replicate · inference-infra · 2025-07-17

Bria is now on Replicate

Score 6

We've partnered with Bria to bring a suite of commercial-grade image generation and editing models to Replicate. Built entirely on licensed data, Bria’s tools are designed for enterprises and developers building safely with visual AI.

Watchlist Matched: generation

Replicate · inference-infra · 2025-05-15

Run 30,000+ LoRAs on Hugging Face with Replicate

Score 6

We've partnered with Hugging Face to bring Replicate inference to their platform.

Watchlist Matched: inference

Replicate · inference-infra · 2024-11-21

FLUX.1 Tools – Control and steerability for FLUX

Score 6

A new set of image generation capabilities for FLUX models, including inpainting, outpainting, canny edge detection, and depth maps.

Watchlist Matched: generation

Replicate · inference-infra · 2024-08-15

Fine-tune FLUX.1 with your own images

Score 6

We've added fine-tuning (LoRA) support to FLUX.1 image generation models. You can train FLUX.1 on your own images with one line of code using Replicate's API.

inference fine-tuning api

Watchlist Matched: generation, fine-tuning, lora, api

Replicate · inference-infra · 2024-07-12

Replicate Intelligence #7

Score 6

Data curation, data generation, data data data

Watchlist Matched: generation

Replicate · inference-infra · 2024-06-07

Replicate Intelligence #3

Score 6

Garden State Llama, applied LLMs guide, real-time image generation

Watchlist Matched: generation

Replicate · inference-infra · 2024-05-31

Replicate Intelligence #2

Score 6

Faster image generation, AI-powered world simulator, insights on AI dataset complexity

Watchlist Matched: generation