long-context - MLSys Blogs

Moreh · korea · 2026-06-03

Optimizing Long-Context Prefill on Multiple (Older-Generation) GPU Nodes

Score 23

No feed summary available yet.

inference hardware long-context

Open

High signal Matched: prefill, generation, gpu, long-context

Together AI · inference-infra · 2026-05-11

Serving DeepSeek-V4: why million-token context is an inference systems problem

Score 22

DeepSeek-V4 makes million-token context a serving-systems problem. Together AI explores the inference work behind V4 on NVIDIA HGX B200, including compressed KV layouts, prefix caching, kernel maturity, and endpoint profiles for long-conte...

inference serving kernel hardware long-context api

Open

High signal Matched: inference, serving, endpoint, kernel, b200, long-context

BAIR · research · 2026-05-08

Adaptive Parallel Reasoning: The Next Paradigm in Efficient Inference Scaling

Score 28

.apr-fig { text-align: center; margin: 1.35em 0; line-height: 1.4; } .apr-fig--wide img { display: inline-block; width: 100%; max-width: 100%; height: auto; vertical-align: middle; } .apr-fig--wide-0-8 { max-width: 80%; margin-left: auto;...

inference serving kv-cache speculative-decoding benchmark model-release research training fine-tuning evals long-context agents frontier-model

Open

High signal Matched: inference, decoding, prefill, generation, serve, throughput, kv cache, verification, performance, latency, cost, model, paper, research, evaluation, training, pretraining, sft, benchmarks, long context, context window, agentic, reasoning model

Together AI · inference-infra · 2026-04-29

DeepSeek-V4 Pro now available on Together AI

Score 10

DeepSeek-V4 Pro is now available on Together AI with 512K context, controllable reasoning modes, and cached-input pricing for long-context reasoning workloads like code agents, document intelligence, and research synthesis.

research long-context agents

Open

High signal Matched: research, long-context, agents

Hugging Face · open-source · 2026-04-29

Introducing NVIDIA Nemotron 3 Nano Omni: Long-Context Multimodal Intelligence for Documents, Audio and Video Agents

Score 10

No feed summary available yet.

model-release long-context agents

Open

High signal Matched: introducing, long-context, agents

vLLM Project · open-source · 2026-04-22

The State of FP8 KV-Cache and Attention Quantization in vLLM

Score 18

Long-context LLM serving is increasingly memory-bound: for standard full-attention decoders, the KV cache often dominates GPU memory at 128k+ contexts, and each decode step must read a large...

inference serving kv-cache hardware model-release quantization long-context

Open

High signal Matched: serving, kv cache, gpu, fp8, quantization, long-context

Together AI · inference-infra · 2026-03-26

Plan, divide, and conquer: How weak models excel at long context tasks

Score 10

As context windows grow, LLM performance degrades in unexpected ways. We show how a "Divide & Conquer" framework — breaking long documents into parallel chunks with a planner, workers, and manager — lets smaller models like Llama-3-70B and...

benchmark long-context

Open

High signal Matched: performance, long context

Nota AI · korea · 2026-03-23

[GTC 2026 Recap] The Trillion-Dollar Inference Race Begins: How Nota AI Fills the Gap

Score 42

  Jaehoon Lee Technical Content Manager, Nota AI   GTC has evolved far beyond a technology conference, drawing attention from global economies and financial markets alike. This year, CEO Jensen Huang took the stage in his tradema...

inference serving kernel cuda kv-cache benchmark hardware model-release research cloud training long-context agents open-source

Open

High signal Matched: inference, prefill, generation, throughput, cuda, kv cache, performance, latency, cost, gpu, npu, launch, model, research, cloud, training, long-context, context window, agent, agents, agentic, open-source

Nota AI · korea · 2026-03-13

NotaMoEQuantization: An MoE-Specific Quantization Method for Solar-Open-100B

Score 62

  Hancheol Park, Ph. D. AI Research Engineer, Nota AI Tairen PiaoAI Research Engineer, Nota AI Tae-Ho KimCTO & Co-Founder, Nota AI ✔️ Resource : The official quantized model of Solar-Open-100B, which passed the first round of Sout...

inference serving moe benchmark hardware model-release research korea training quantization evals long-context open-source

Open

High signal Matched: inference, serving, prefill, generation, throughput, moe, router, benchmark, performance, latency, ttft, tpot, blackwell, release, model, weights, open model, research, evaluation, korea, korean, upstage, training, post-training, quantization, quantized, int4, evaluate, benchmarks, mmlu, long-context

BAIR · research · 2026-03-13

Identifying Interactions at Scale for LLMs

Score 18

--> Understanding the behavior of complex machine learning systems, particularly Large Language Models (LLMs), is a critical challenge in modern artificial intelligence. Interpretability research aims to make the decision-making process mo...

inference serving benchmark model-release research training evals long-context rag

Open

High signal Matched: inference, serving, decoding, performance, cost, model, research, training, evaluate, mmlu, long-context, rag

Together AI · inference-infra · 2026-03-11

Together AI Brings NVIDIA Nemotron 3 to Developers on Day 0

Score 10

NVIDIA Nemotron 3 Super is now available on Together AI Dedicated Inference, delivering efficient multi-agent reasoning, a 1M-token context window, and production-grade deployment on managed infrastructure.

inference long-context agents

Open

High signal Matched: inference, context window, agent

Together AI · inference-infra · 2026-03-04

Cache-aware prefill–decode disaggregation (CPD) for up to 40% faster long-context LLM serving

Score 20

Serving long prompts doesn't have to mean slow responses. Learn how Together AI's CPD architecture separates warm and cold inference workloads to deliver 40% higher throughput and dramatically lower time-to-first-token for long-context LLM...

inference serving benchmark long-context

Open

High signal Matched: inference, serving, prefill, throughput, long-context

Hugging Face · open-source · 2025-04-16

Introducing HELMET: Holistically Evaluating Long-context Language Models

Score 10

No feed summary available yet.

model-release evals long-context

Open

High signal Matched: introducing, evaluating, long-context

AIBrix · open-source · 2025-03-10

DeepSeek-R1 671B multi-host Deployment in AIBrix

Score 20

This blog post introduces deploying DeepSeek R1 using AIBrix. DeepSeek-R1 demonstrates remarkable proficiency in reasoning tasks through step-by-step training process. It features 671B total parameters with 37B active parameters, and 128k...

inference distributed benchmark model-release training long-context

Open

High signal Matched: inference, distributed, benchmark, model, weights, training, context length

Stanford CRFM · research · 2026-06-03

HELM Long Context

Score 2

No feed summary available yet.

long-context

Open

Watchlist Matched: long context

vLLM Project · open-source · 2026-04-24

DeepSeek V4 in vLLM: Efficient Long-context Attention

Score 3

A first-principles walkthrough of DeepSeek V4's long-context attention, and how we implemented it in vLLM.

long-context

Open

Watchlist Matched: long-context

AI2 · research · 2026-04-23

OlmPool: How small architectural choices compound to undermine long context extension

Score 0

OlmPool is a controlled suite of 26 models showing how small architecture choices can compound to make long-context extension much harder, even when training data and extension recipes are held constant.

training long-context

Open

Watchlist Matched: training, long context, long-context

Hugging Face · open-source · 2025-07-08