MLSys Radar

moe

vLLM Project · open-source · 2026-05-14

Elastic Expert Parallelism in vLLM

Score 16

Expert parallelism (EP) is a key technique for serving Mixture-of-Experts (MoE) models at high throughput. WideEP deployments (where EP spans many workers) maximize KV cache capacity, enabling...

inference serving kv-cache moe benchmark

Open

High signal Matched: serving, throughput, kv cache, moe

Nota AI · korea · 2026-05-11

[NetsPresso® x AI Agents] Easier to Use, Even More Powerful

Score 52

  Jaehoon Lee Technical Content Manager, Nota AI   NetsPresso® now embraces AI agents. An easy-to-use interface sits on top of the validated pipeline that handles everything from model compression to device deployment.When a user...

inference serving kernel speculative-decoding moe benchmark hardware model-release research quantization evals agents api

Open

High signal Matched: inference, endpoint, kernel, verification, moe, benchmark, latency, cost, gpu, release, model, evaluation, quantization, quantized, int4, evaluate, benchmarks, swe-bench, mmlu, agent, agents, api

Nota AI · korea · 2026-04-29

[NVIDIA Nemotron Hackathon] Grand Prize Among 20 Teams: Behind Two Sleepless Days

Score 32

  Hancheol Park, Ph. D.AI Research Engineer, NetsPresso Tech, Nota AI Geonmin Kim, Ph. D.AI Research Engineer, NetsPresso Tech, Nota AI Geonho LeeEdge AI Engineer Intern, NetsPresso Tech, Nota AI Jaehoon Lee Technical Content Manager,...

inference moe benchmark model-release research korea training fine-tuning quantization evals agents

Open

High signal Matched: generation, moe, performance, model, weights, paper, research, evaluation, korea, korean, seoul, naver, training, fine-tuning, quantization, agent, agents, agentic

LMCache · open-source · 2026-04-04

LMCache’s New Architecture Boosts MoE Inference Performance by 10×

Score 34

Modern LLM serving workloads are defined by strict latency requirements, high concurrency, and rapidly growing context lengths. Applications such as multi-turn chat, AI agents, and retrieval-augmented generation continuously build on prior...

inference serving kv-cache moe benchmark rag agents

Open

High signal Matched: inference, serving, decoding, generation, throughput, lmcache, moe, performance, latency, ttft, retrieval-augmented generation, retrieval, agents

Nota AI · korea · 2026-03-20

GenAI Everywhere: The Future of Edge AI Optimization with the New NetsPresso®

Score 26

  NP Product Team, Nota AI   The role of Edge AI is rapidly expanding.Offline voice assistants now carry on conversations in our daily lives, vehicles infer routes in real time, and smartphones generate images without a network c...

inference kv-cache moe benchmark model-release research korea quantization

Open

High signal Matched: inference, kv cache, moe, benchmark, performance, latency, cost, model, research, seoul, quantization

Nota AI · korea · 2026-03-13

NotaMoEQuantization: An MoE-Specific Quantization Method for Solar-Open-100B

Score 62

  Hancheol Park, Ph. D. AI Research Engineer, Nota AI Tairen PiaoAI Research Engineer, Nota AI Tae-Ho KimCTO & Co-Founder, Nota AI ✔️ Resource : The official quantized model of Solar-Open-100B, which passed the first round of Sout...

inference serving moe benchmark hardware model-release research korea training quantization evals long-context open-source

Open

High signal Matched: inference, serving, prefill, generation, throughput, moe, router, benchmark, performance, latency, ttft, tpot, blackwell, release, model, weights, open model, research, evaluation, korea, korean, upstage, training, post-training, quantization, quantized, int4, evaluate, benchmarks, mmlu, long-context