Fireworks AI · inference-infra · 2026-06-03
Training-Inference Parity in MoE Models: Where Numerics Drift
No feed summary available yet.
High signal Matched: inference, moe
Fireworks AI · inference-infra · 2026-06-03
No feed summary available yet.
High signal Matched: inference, moe
vLLM Project · open-source · 2026-06-02
Long-horizon LLM agents create a routing problem that single-turn prompt routers were not designed to solve. A router still needs to know which model is best for the current request, but it also...
High signal Matched: router, model, agents, agentic
vLLM Project · open-source · 2026-05-28
Most routing systems start with a prompt and choose a model endpoint. vLLM Semantic Router (VSR) makes a different bet: before a request reaches the serving model, the system should extract...
High signal Matched: serving, endpoint, router, model
Modular · inference-infra · 2026-05-21
Why LLM Inference Needs a New Kind of Router - Part 2
High signal Matched: inference, router
vLLM Project · open-source · 2026-05-14
Expert parallelism (EP) is a key technique for serving Mixture-of-Experts (MoE) models at high throughput. WideEP deployments (where EP spans many workers) maximize KV cache capacity, enabling...
High signal Matched: serving, throughput, kv cache, moe
LMCache · open-source · 2026-05-13
A practitioner’s guide to KV-cache tiering on ROCm — what works, what doesn’t, and the regime where it actually matters. Key Summary We benchmarked multi-turn agentic workloads using 739 anonymized Claude Code conversation trac...
High signal Matched: lmcache, moe, mi300x, rocm, fp8, agentic
Nota AI · korea · 2026-05-11
Jaehoon Lee Technical Content Manager, Nota AI NetsPresso® now embraces AI agents. An easy-to-use interface sits on top of the validated pipeline that handles everything from model compression to device deployment.When a user...
High signal Matched: inference, endpoint, kernel, verification, moe, benchmark, latency, cost, gpu, release, model, evaluation, quantization, quantized, int4, evaluate, benchmarks, swe-bench, mmlu, agent, agents, api
Modular · inference-infra · 2026-05-08
Why LLM Inference Needs a New Kind of Router - Part 1
High signal Matched: inference, router
AI2 · research · 2026-05-08
EMO is a new mixture-of-experts model trained so modular expert groups emerge from data, enabling users to select small task-specific expert subsets while preserving near full-model performance.
High signal Matched: mixture of experts, performance, model, pretraining
Nota AI · korea · 2026-04-29
Hancheol Park, Ph. D.AI Research Engineer, NetsPresso Tech, Nota AI Geonmin Kim, Ph. D.AI Research Engineer, NetsPresso Tech, Nota AI Geonho LeeEdge AI Engineer Intern, NetsPresso Tech, Nota AI Jaehoon Lee Technical Content Manager,...
High signal Matched: generation, moe, performance, model, weights, paper, research, evaluation, korea, korean, seoul, naver, training, fine-tuning, quantization, agent, agents, agentic
LMCache · open-source · 2026-04-04
Modern LLM serving workloads are defined by strict latency requirements, high concurrency, and rapidly growing context lengths. Applications such as multi-turn chat, AI agents, and retrieval-augmented generation continuously build on prior...
High signal Matched: inference, serving, decoding, generation, throughput, lmcache, moe, performance, latency, ttft, retrieval-augmented generation, retrieval, agents
Nota AI · korea · 2026-03-20
NP Product Team, Nota AI The role of Edge AI is rapidly expanding.Offline voice assistants now carry on conversations in our daily lives, vehicles infer routes in real time, and smartphones generate images without a network c...
High signal Matched: inference, kv cache, moe, benchmark, performance, latency, cost, model, research, seoul, quantization
Nota AI · korea · 2026-03-13
Hancheol Park, Ph. D. AI Research Engineer, Nota AI Tairen PiaoAI Research Engineer, Nota AI Tae-Ho KimCTO & Co-Founder, Nota AI ✔️ Resource : The official quantized model of Solar-Open-100B, which passed the first round of Sout...
High signal Matched: inference, serving, prefill, generation, throughput, moe, router, benchmark, performance, latency, ttft, tpot, blackwell, release, model, weights, open model, research, evaluation, korea, korean, upstage, training, post-training, quantization, quantized, int4, evaluate, benchmarks, mmlu, long-context
vLLM Project · open-source · 2026-03-10
Since v0.1 Iris, vLLM Semantic Router has made a large jump. In one release cycle, the project rebuilt its model stack, expanded routing into safety, semantic caching, memory, retrieval, and...
High signal Matched: router, release, model, retrieval
Hugging Face · open-source · 2026-02-26
No feed summary available yet.
High signal Matched: mixture of experts
vLLM Project · open-source · 2026-02-26
Organizations and individuals running multiple custom AI models, especially recent Mixture of Experts (MoE) model families, can face the challenge of paying for idle GPU capacity when the...
High signal Matched: serve, moe, mixture of experts, gpu, model, sagemaker, bedrock
vLLM Project · open-source · 2026-02-13
DeepSeek-V3.2 (NVFP4 + TP2)has been successfully and smoothly run on GB300 (SM103 - Blackwell Ultra). Leveraging FP4 quantization, it achieves a single-GPU throughput of 7360 TGS (tokens / GPU /...
High signal Matched: throughput, deepseek-v3, performance, gpu, blackwell, quantization
vLLM Project · open-source · 2026-01-05
vLLM Semantic Router is the System Level Intelligence for Mixture-of-Models (MoM), bringing Collective Intelligence into LLM systems. It lives between users and models, capturing signals from...
High signal Matched: router, release
vLLM Project · open-source · 2025-12-16
Over the past several months, AMD and the vLLM SR Team have been collaborating to bring vLLM Semantic Router (VSR) to AMD GPUs—not just as a performance optimization, but as a fundamental shift in...
High signal Matched: router, performance
vLLM Project · open-source · 2025-12-13
Efficiently managing request distribution across a fleet of model replicas is a critical requirement for large-scale, production vLLM deployments. Standard load balancers often fall short as they...
High signal Matched: serving, prefill, router, performance, model
llm-d · open-source · 2025-12-02
llm-d v0.4 delivers 50% lower latency for MoE models via speculative decoding, expands TPU and XPU support, and adds prefix cache offloading for faster TTFT.
High signal Matched: decoding, prefix cache, speculative decoding, moe, performance, latency, ttft, tpu, sota
Together AI · inference-infra · 2025-10-10
LLM inference that gets faster as you use it. Our runtime-learning accelerator adapts continuously to your workload, delivering 500 TPS on DeepSeek-V3.1, a 4x speedup over baseline performance without manual tuning.
High signal Matched: inference, deepseek-v3, performance, accelerator
Together AI · inference-infra · 2025-08-27
Access DeepSeek-V3.1 on Together AI: MIT-licensed hybrid model with thinking/non-thinking modes, 66% SWE-bench Verified, serverless deployment, 99.9% SLA.
High signal Matched: deepseek-v3, model, swe-bench
llm-d · open-source · 2025-07-29
llm-d v0.2 introduces well-lit paths for Kubernetes LLM deployment: intelligent scheduling, P/D disaggregation, and MoE support with vLLM optimizations.
High signal Matched: moe
SkyPilot · open-source · 2023-12-21
A tutorial for serving Mixtral 8x7B model with SkyPilot and SkyServe.
High signal Matched: serving, mixtral, cost, gpu, model
Hugging Face · open-source · 2023-12-11
No feed summary available yet.
High signal Matched: mixture of experts, mixtral, sota
Hugging Face · open-source · 2023-12-11
No feed summary available yet.
High signal Matched: mixture of experts