MLSys Radar

long-context

Together AI · inference-infra · 2026-05-11

Serving DeepSeek-V4: why million-token context is an inference systems problem

Score 22

DeepSeek-V4 makes million-token context a serving-systems problem. Together AI explores the inference work behind V4 on NVIDIA HGX B200, including compressed KV layouts, prefix caching, kernel maturity, and endpoint profiles for long-conte...

inference serving kernel hardware long-context api

Open

High signal Matched: inference, serving, endpoint, kernel, b200, long-context

BAIR · research · 2026-05-08

Adaptive Parallel Reasoning: The Next Paradigm in Efficient Inference Scaling

Score 28

.apr-fig { text-align: center; margin: 1.35em 0; line-height: 1.4; } .apr-fig--wide img { display: inline-block; width: 100%; max-width: 100%; height: auto; vertical-align: middle; } .apr-fig--wide-0-8 { max-width: 80%; margin-left: auto;...

inference serving kv-cache speculative-decoding benchmark model-release research training fine-tuning evals long-context agents frontier-model

Open

High signal Matched: inference, decoding, prefill, generation, serve, throughput, kv cache, verification, performance, latency, cost, model, paper, research, evaluation, training, pretraining, sft, benchmarks, long context, context window, agentic, reasoning model

Together AI · inference-infra · 2026-04-29

DeepSeek-V4 Pro now available on Together AI

Score 10

DeepSeek-V4 Pro is now available on Together AI with 512K context, controllable reasoning modes, and cached-input pricing for long-context reasoning workloads like code agents, document intelligence, and research synthesis.

research long-context agents

Open

High signal Matched: research, long-context, agents

Nota AI · korea · 2026-03-23

[GTC 2026 Recap] The Trillion-Dollar Inference Race Begins: How Nota AI Fills the Gap

Score 42

  Jaehoon Lee Technical Content Manager, Nota AI   GTC has evolved far beyond a technology conference, drawing attention from global economies and financial markets alike. This year, CEO Jensen Huang took the stage in his tradema...

inference serving kernel cuda kv-cache benchmark hardware model-release research cloud training long-context agents open-source

Open

High signal Matched: inference, prefill, generation, throughput, cuda, kv cache, performance, latency, cost, gpu, npu, launch, model, research, cloud, training, long-context, context window, agent, agents, agentic, open-source

Nota AI · korea · 2026-03-13

NotaMoEQuantization: An MoE-Specific Quantization Method for Solar-Open-100B

Score 62

  Hancheol Park, Ph. D. AI Research Engineer, Nota AI Tairen PiaoAI Research Engineer, Nota AI Tae-Ho KimCTO & Co-Founder, Nota AI ✔️ Resource : The official quantized model of Solar-Open-100B, which passed the first round of Sout...

inference serving moe benchmark hardware model-release research korea training quantization evals long-context open-source

Open

High signal Matched: inference, serving, prefill, generation, throughput, moe, router, benchmark, performance, latency, ttft, tpot, blackwell, release, model, weights, open model, research, evaluation, korea, korean, upstage, training, post-training, quantization, quantized, int4, evaluate, benchmarks, mmlu, long-context

BAIR · research · 2026-03-13

Identifying Interactions at Scale for LLMs

Score 18

--> Understanding the behavior of complex machine learning systems, particularly Large Language Models (LLMs), is a critical challenge in modern artificial intelligence. Interpretability research aims to make the decision-making process mo...

inference serving benchmark model-release research training evals long-context rag

Open

High signal Matched: inference, serving, decoding, performance, cost, model, research, training, evaluate, mmlu, long-context, rag

Together AI · inference-infra · 2026-03-04

Cache-aware prefill–decode disaggregation (CPD) for up to 40% faster long-context LLM serving

Score 20

Serving long prompts doesn't have to mean slow responses. Learn how Together AI's CPD architecture separates warm and cold inference workloads to deliver 40% higher throughput and dramatically lower time-to-first-token for long-context LLM...

inference serving benchmark long-context

Open

High signal Matched: inference, serving, prefill, throughput, long-context

AIBrix · open-source · 2025-03-10

DeepSeek-R1 671B multi-host Deployment in AIBrix

Score 20

This blog post introduces deploying DeepSeek R1 using AIBrix. DeepSeek-R1 demonstrates remarkable proficiency in reasoning tasks through step-by-step training process. It features 671B total parameters with 37B active parameters, and 128k...

inference distributed benchmark model-release training long-context

Open

High signal Matched: inference, distributed, benchmark, model, weights, training, context length