Moreh · korea · 2026-06-03
Optimizing Long-Context Prefill on Multiple (Older-Generation) GPU Nodes
No feed summary available yet.
High signal Matched: prefill, generation, gpu, long-context
Moreh · korea · 2026-06-03
No feed summary available yet.
High signal Matched: prefill, generation, gpu, long-context
Together AI · inference-infra · 2026-05-11
DeepSeek-V4 makes million-token context a serving-systems problem. Together AI explores the inference work behind V4 on NVIDIA HGX B200, including compressed KV layouts, prefix caching, kernel maturity, and endpoint profiles for long-conte...
High signal Matched: inference, serving, endpoint, kernel, b200, long-context
BAIR · research · 2026-05-08
.apr-fig { text-align: center; margin: 1.35em 0; line-height: 1.4; } .apr-fig--wide img { display: inline-block; width: 100%; max-width: 100%; height: auto; vertical-align: middle; } .apr-fig--wide-0-8 { max-width: 80%; margin-left: auto;...
High signal Matched: inference, decoding, prefill, generation, serve, throughput, kv cache, verification, performance, latency, cost, model, paper, research, evaluation, training, pretraining, sft, benchmarks, long context, context window, agentic, reasoning model
Together AI · inference-infra · 2026-04-29
DeepSeek-V4 Pro is now available on Together AI with 512K context, controllable reasoning modes, and cached-input pricing for long-context reasoning workloads like code agents, document intelligence, and research synthesis.
High signal Matched: research, long-context, agents
Hugging Face · open-source · 2026-04-29
No feed summary available yet.
High signal Matched: introducing, long-context, agents
vLLM Project · open-source · 2026-04-22
Long-context LLM serving is increasingly memory-bound: for standard full-attention decoders, the KV cache often dominates GPU memory at 128k+ contexts, and each decode step must read a large...
High signal Matched: serving, kv cache, gpu, fp8, quantization, long-context
Together AI · inference-infra · 2026-03-26
As context windows grow, LLM performance degrades in unexpected ways. We show how a "Divide & Conquer" framework — breaking long documents into parallel chunks with a planner, workers, and manager — lets smaller models like Llama-3-70B and...
High signal Matched: performance, long context
Nota AI · korea · 2026-03-23
Jaehoon Lee Technical Content Manager, Nota AI GTC has evolved far beyond a technology conference, drawing attention from global economies and financial markets alike. This year, CEO Jensen Huang took the stage in his tradema...
High signal Matched: inference, prefill, generation, throughput, cuda, kv cache, performance, latency, cost, gpu, npu, launch, model, research, cloud, training, long-context, context window, agent, agents, agentic, open-source
Nota AI · korea · 2026-03-13
Hancheol Park, Ph. D. AI Research Engineer, Nota AI Tairen PiaoAI Research Engineer, Nota AI Tae-Ho KimCTO & Co-Founder, Nota AI ✔️ Resource : The official quantized model of Solar-Open-100B, which passed the first round of Sout...
High signal Matched: inference, serving, prefill, generation, throughput, moe, router, benchmark, performance, latency, ttft, tpot, blackwell, release, model, weights, open model, research, evaluation, korea, korean, upstage, training, post-training, quantization, quantized, int4, evaluate, benchmarks, mmlu, long-context
BAIR · research · 2026-03-13
--> Understanding the behavior of complex machine learning systems, particularly Large Language Models (LLMs), is a critical challenge in modern artificial intelligence. Interpretability research aims to make the decision-making process mo...
High signal Matched: inference, serving, decoding, performance, cost, model, research, training, evaluate, mmlu, long-context, rag
Together AI · inference-infra · 2026-03-11
NVIDIA Nemotron 3 Super is now available on Together AI Dedicated Inference, delivering efficient multi-agent reasoning, a 1M-token context window, and production-grade deployment on managed infrastructure.
High signal Matched: inference, context window, agent
Together AI · inference-infra · 2026-03-04
Serving long prompts doesn't have to mean slow responses. Learn how Together AI's CPD architecture separates warm and cold inference workloads to deliver 40% higher throughput and dramatically lower time-to-first-token for long-context LLM...
High signal Matched: inference, serving, prefill, throughput, long-context
Hugging Face · open-source · 2025-04-16
No feed summary available yet.
High signal Matched: introducing, evaluating, long-context
AIBrix · open-source · 2025-03-10
This blog post introduces deploying DeepSeek R1 using AIBrix. DeepSeek-R1 demonstrates remarkable proficiency in reasoning tasks through step-by-step training process. It features 671B total parameters with 37B active parameters, and 128k...
High signal Matched: inference, distributed, benchmark, model, weights, training, context length
Stanford CRFM · research · 2026-06-03
No feed summary available yet.
Watchlist Matched: long context
vLLM Project · open-source · 2026-04-24
A first-principles walkthrough of DeepSeek V4's long-context attention, and how we implemented it in vLLM.
Watchlist Matched: long-context
AI2 · research · 2026-04-23
OlmPool is a controlled suite of 26 models showing how small architecture choices can compound to make long-context extension much harder, even when training data and extension recipes are held constant.
Watchlist Matched: training, long context, long-context
Hugging Face · open-source · 2025-07-08
No feed summary available yet.
Watchlist Matched: long-context
Hugging Face · open-source · 2025-03-12
No feed summary available yet.
Watchlist Matched: long context
Hugging Face · open-source · 2024-07-23
No feed summary available yet.
Watchlist Matched: long context