moe - MLSys Blogs

Fireworks AI · inference-infra · 2026-06-03

Training-Inference Parity in MoE Models: Where Numerics Drift

Score 19

No feed summary available yet.

inference moe

Open

High signal Matched: inference, moe

vLLM Project · open-source · 2026-06-02

Session-Aware Agentic Routing: Continuity-Aware Model Selection for Long-Horizon LLM Agents

Score 15

Long-horizon LLM agents create a routing problem that single-turn prompt routers were not designed to solve. A router still needs to know which model is best for the current request, but it also...

moe model-release agents

Open

High signal Matched: router, model, agents, agentic

vLLM Project · open-source · 2026-05-28

From Text to Multimodal Routing: Hardening Vision Signals in vLLM Semantic Router

Score 19

Most routing systems start with a prompt and choose a model endpoint. vLLM Semantic Router (VSR) makes a different bet: before a request reaches the serving model, the system should extract...

inference serving moe model-release api

Open

High signal Matched: serving, endpoint, router, model

Modular · inference-infra · 2026-05-21

Why LLM Inference Needs a New Kind of Router - Part 2

Score 14

Why LLM Inference Needs a New Kind of Router - Part 2

inference moe

Open

High signal Matched: inference, router

vLLM Project · open-source · 2026-05-14

Elastic Expert Parallelism in vLLM

Score 16

Expert parallelism (EP) is a key technique for serving Mixture-of-Experts (MoE) models at high throughput. WideEP deployments (where EP spans many workers) maximize KV cache capacity, enabling...

inference serving kv-cache moe benchmark

Open

High signal Matched: serving, throughput, kv cache, moe

LMCache · open-source · 2026-05-13

Benchmarking LMCache for Multi-Turn Agentic Workloads on AMD MI300X

Score 20

A practitioner’s guide to KV-cache tiering on ROCm — what works, what doesn’t, and the regime where it actually matters. Key Summary We benchmarked multi-turn agentic workloads using 739 anonymized Claude Code conversation trac...

kv-cache moe hardware model-release quantization agents

Open

High signal Matched: lmcache, moe, mi300x, rocm, fp8, agentic

Nota AI · korea · 2026-05-11

[NetsPresso® x AI Agents] Easier to Use, Even More Powerful

Score 52

  Jaehoon Lee Technical Content Manager, Nota AI   NetsPresso® now embraces AI agents. An easy-to-use interface sits on top of the validated pipeline that handles everything from model compression to device deployment.When a user...

inference serving kernel speculative-decoding moe benchmark hardware model-release research quantization evals agents api

Open

High signal Matched: inference, endpoint, kernel, verification, moe, benchmark, latency, cost, gpu, release, model, evaluation, quantization, quantized, int4, evaluate, benchmarks, swe-bench, mmlu, agent, agents, api

Modular · inference-infra · 2026-05-08

Why LLM Inference Needs a New Kind of Router - Part 1

Score 14

Why LLM Inference Needs a New Kind of Router - Part 1

inference moe

Open

High signal Matched: inference, router

AI2 · research · 2026-05-08

EMO: Pretraining mixture of experts for emergent modularity

Score 12

EMO is a new mixture-of-experts model trained so modular expert groups emerge from data, enabling users to select small task-specific expert subsets while preserving near full-model performance.

moe benchmark model-release training

Open

High signal Matched: mixture of experts, performance, model, pretraining

Nota AI · korea · 2026-04-29

[NVIDIA Nemotron Hackathon] Grand Prize Among 20 Teams: Behind Two Sleepless Days

Score 32

  Hancheol Park, Ph. D.AI Research Engineer, NetsPresso Tech, Nota AI Geonmin Kim, Ph. D.AI Research Engineer, NetsPresso Tech, Nota AI Geonho LeeEdge AI Engineer Intern, NetsPresso Tech, Nota AI Jaehoon Lee Technical Content Manager,...

inference moe benchmark model-release research korea training fine-tuning quantization evals agents

Open

High signal Matched: generation, moe, performance, model, weights, paper, research, evaluation, korea, korean, seoul, naver, training, fine-tuning, quantization, agent, agents, agentic

LMCache · open-source · 2026-04-04

LMCache’s New Architecture Boosts MoE Inference Performance by 10×

Score 34

Modern LLM serving workloads are defined by strict latency requirements, high concurrency, and rapidly growing context lengths. Applications such as multi-turn chat, AI agents, and retrieval-augmented generation continuously build on prior...

inference serving kv-cache moe benchmark rag agents

Open

High signal Matched: inference, serving, decoding, generation, throughput, lmcache, moe, performance, latency, ttft, retrieval-augmented generation, retrieval, agents

Nota AI · korea · 2026-03-20

GenAI Everywhere: The Future of Edge AI Optimization with the New NetsPresso®

Score 26

  NP Product Team, Nota AI   The role of Edge AI is rapidly expanding.Offline voice assistants now carry on conversations in our daily lives, vehicles infer routes in real time, and smartphones generate images without a network c...

inference kv-cache moe benchmark model-release research korea quantization

Open

High signal Matched: inference, kv cache, moe, benchmark, performance, latency, cost, model, research, seoul, quantization

Nota AI · korea · 2026-03-13

NotaMoEQuantization: An MoE-Specific Quantization Method for Solar-Open-100B

Score 62

  Hancheol Park, Ph. D. AI Research Engineer, Nota AI Tairen PiaoAI Research Engineer, Nota AI Tae-Ho KimCTO & Co-Founder, Nota AI ✔️ Resource : The official quantized model of Solar-Open-100B, which passed the first round of Sout...

inference serving moe benchmark hardware model-release research korea training quantization evals long-context open-source

Open

High signal Matched: inference, serving, prefill, generation, throughput, moe, router, benchmark, performance, latency, ttft, tpot, blackwell, release, model, weights, open model, research, evaluation, korea, korean, upstage, training, post-training, quantization, quantized, int4, evaluate, benchmarks, mmlu, long-context

vLLM Project · open-source · 2026-03-10

vLLM Semantic Router v0.2 Athena: ClawOS, Model Refresh, and the System Brain

Score 18

Since v0.1 Iris, vLLM Semantic Router has made a large jump. In one release cycle, the project rebuilt its model stack, expanded routing into safety, semantic caching, memory, retrieval, and...

moe model-release rag

Open

High signal Matched: router, release, model, retrieval

Hugging Face · open-source · 2026-02-26

Mixture of Experts (MoEs) in Transformers

Score 10

No feed summary available yet.

moe

Open

High signal Matched: mixture of experts

vLLM Project · open-source · 2026-02-26

Efficiently serve dozens of fine-tuned models with vLLM on Amazon SageMaker AI and Amazon Bedrock

Score 30

Organizations and individuals running multiple custom AI models, especially recent Mixture of Experts (MoE) model families, can face the challenge of paying for idle GPU capacity when the...

serving moe hardware model-release cloud

Open

High signal Matched: serve, moe, mixture of experts, gpu, model, sagemaker, bedrock

vLLM Project · open-source · 2026-02-13

DeepSeek-V3.2 on GB300: Performance Breakthrough

Score 22

DeepSeek-V3.2 (NVFP4 + TP2)has been successfully and smoothly run on GB300 (SM103 - Blackwell Ultra). Leveraging FP4 quantization, it achieves a single-GPU throughput of 7360 TGS (tokens / GPU /...

serving moe benchmark hardware quantization

Open

High signal Matched: throughput, deepseek-v3, performance, gpu, blackwell, quantization

vLLM Project · open-source · 2026-01-05

vLLM Semantic Router v0.1 Iris: The First Major Release

Score 16

vLLM Semantic Router is the System Level Intelligence for Mixture-of-Models (MoM), bringing Collective Intelligence into LLM systems. It lives between users and models, capturing signals from...

moe model-release

Open

High signal Matched: router, release

vLLM Project · open-source · 2025-12-16

AMD × vLLM Semantic Router: Building the System Intelligence Together

Score 14

Over the past several months, AMD and the vLLM SR Team have been collaborating to bring vLLM Semantic Router (VSR) to AMD GPUs—not just as a performance optimization, but as a fundamental shift in...

moe benchmark

Open

High signal Matched: router, performance

vLLM Project · open-source · 2025-12-13

vLLM Router: A High-Performance and Prefill/Decode Aware Load Balancer for Large-scale Serving

Score 26

Efficiently managing request distribution across a fleet of model replicas is a critical requirement for large-scale, production vLLM deployments. Standard load balancers often fall short as they...

inference serving moe benchmark model-release

Open

High signal Matched: serving, prefill, router, performance, model

llm-d · open-source · 2025-12-02

llm-d 0.4: Achieve SOTA Performance Across Accelerators

Score 30

llm-d v0.4 delivers 50% lower latency for MoE models via speculative decoding, expands TPU and XPU support, and adds prefix cache offloading for faster TTFT.

inference kv-cache speculative-decoding moe benchmark hardware frontier-model

Open

High signal Matched: decoding, prefix cache, speculative decoding, moe, performance, latency, ttft, tpu, sota

Together AI · inference-infra · 2025-10-10

AdapTive-LeArning Speculator System (ATLAS): A New Paradigm in LLM Inference via Runtime-Learning Accelerators

Score 20

LLM inference that gets faster as you use it. Our runtime-learning accelerator adapts continuously to your workload, delivering 500 TPS on DeepSeek-V3.1, a 4x speedup over baseline performance without manual tuning.

inference moe benchmark hardware

Open

High signal Matched: inference, deepseek-v3, performance, accelerator

Together AI · inference-infra · 2025-08-27

DeepSeek-V3.1: Hybrid Thinking Model Now Available on Together AI

Score 16

Access DeepSeek-V3.1 on Together AI: MIT-licensed hybrid model with thinking/non-thinking modes, 66% SWE-bench Verified, serverless deployment, 99.9% SLA.

moe model-release evals

Open

High signal Matched: deepseek-v3, model, swe-bench

llm-d · open-source · 2025-07-29

llm-d 0.2: Our first well-lit paths (mind the tree roots!)

Score 10

llm-d v0.2 introduces well-lit paths for Kubernetes LLM deployment: intelligent scheduling, P/D disaggregation, and MoE support with vLLM optimizations.

moe

Open

High signal Matched: moe

SkyPilot · open-source · 2023-12-21