Moreh · korea · 2026-06-03
Distributed Inference on Heterogeneous Accelerators Including GPUs, Rubin CPX, and AI Accelerators
No feed summary available yet.
High signal Matched: inference, distributed
Moreh · korea · 2026-06-03
No feed summary available yet.
High signal Matched: inference, distributed
TensorRT-LLM · open-source · 2026-06-03
No feed summary available yet.
High signal Matched: generation, distributed
Runpod · cloud · 2026-06-03
No feed summary available yet.
High signal Matched: multi-node, gpu
PyTorch Foundation · open-source · 2026-06-01
TL;DR: This case study demonstrates how LinkedIn re-architected its distributed linear programming solver, DuaLip, by developing a GPU-accelerated PyTorch version to handle extreme-scale optimization challenges like web applications. This...
High signal Matched: distributed, gpu
NVIDIA Technical Blog · hardware · 2026-06-01
The rise of autonomous, long-running AI agents has introduced a new class of compute demand, namely tasks that maintain large context windows, spawn concurrent...
High signal Matched: multi-node, agents
Lambda · cloud · 2026-06-01
When we design large GPU clusters, the network is no longer a background system. It's part of the compute envelope. At the 800G and NVIDIA GB300 NVL72 scale, the back-end fabric accounts for 86% of networking power in a three-layer cluster...
High signal Matched: generation, token generation, throughput, infiniband, gpu, model, retrieval, agentic
AMD ROCm Blogs · hardware · 2026-05-25
Local large language model (LLM) inference has rapidly evolved, but a persistent limitation remains: model size is constrained by available GPU memory. Discrete GPUs typically offer 8–24 GB of dedicated VRAM, which can limit the size of mo...
High signal Matched: inference, multi-gpu, gpu, model, checkpoint, cloud, quantization, evaluate
NVIDIA Technical Blog · hardware · 2026-05-07
Distributed deep learning depends on fast, reliable GPU-to-GPU communication using the NVIDIA Collective Communication Library (NCCL). When training slows down,...
High signal Matched: distributed, nccl, performance, gpu, training
vLLM Project · open-source · 2026-05-06
TL;DR: Agentic workloads generate massive shared prefixes that are often recomputed across turns. By integrating Mooncake's distributed KV cache store into vLLM, we achieve 3.8x higher throughput,...
High signal Matched: serving, throughput, distributed, kv cache, agentic
SkyPilot · open-source · 2026-05-01
We ran hundreds of benchmarks to tune storage systems for distributed training so you don’t have to.
High signal Matched: distributed, training, distributed training, benchmarks
Nota AI · korea · 2026-04-08
Jaehoon Lee Technical Content Manager, Nota AI AI Model Optimization: Why Models Won't Run on HardwareThe Chip Is Ready, but the Model Won't DeployIf you have ever tried deploying an AI model onto your own chip, the following...
High signal Matched: inference, multi-gpu, kv cache, verification, performance, latency, gpu, model, research, evaluation, quantization, quantized, awq, gptq, evaluate
NVIDIA Technical Blog · hardware · 2026-03-16
Reasoning models are growing rapidly in size and are increasingly being integrated into agentic AI workflows that interact with other models and external tools....
High signal Matched: inference, multi-node, agentic
SkyPilot · open-source · 2026-01-13
Run Meta's SAM3 on large video archives distributed across AWS and Kubernetes clusters with SkyPilot Pools.
High signal Matched: distributed
Together AI · inference-infra · 2026-01-12
Learn how foundation models are trained at scale using multi-node GPU clusters, including distributed training techniques, infrastructure requirements, and practical steps to scale training efficiently.
High signal Matched: distributed, multi-node, gpu, model, training, distributed training
Nota AI · korea · 2025-12-19
Seungmin YangEdgeFM Lead, Nota AI On this page ▾ SummaryWith the introduction of NVFP4—a new 4-bit floating point data type in NVIDIA’s Blackwell GPU architecture—LLM inference achieves markedly improved efficiency.Blackwell’s NVFP4...
High signal Matched: inference, serving, decoding, prefill, generation, token generation, throughput, kernel, gemm, cutlass, distributed, benchmark, performance, latency, ttft, tpot, tokens/sec, cost, gpu, blackwell, launch, model, weights, fp8, research, training, post-training, quantization, quantized, awq, benchmarks, mmlu, retrieval
vLLM Project · open-source · 2025-11-22
Ray now has a new command: ray symmetric-run. This command makes it possible to launch the same entrypoint command on every node in a Ray cluster, simplifying the workflow to spawn vLLM servers...
High signal Matched: serving, multi-node, launch
Modal · inference-infra · 2025-11-04
How we built a real-time voice bot on Modal's distributed serverless platform.
High signal Matched: distributed, latency
llm-d · open-source · 2025-09-24
See how llm-d's precise KV-cache aware scheduling delivers 57x faster responses and 2x throughput in production distributed LLM inference benchmarks.
High signal Matched: inference, throughput, distributed, benchmarks
SkyPilot · open-source · 2025-09-11
This page has moved. If you are not redirected automatically, click here.
High signal Matched: distributed, training
Hugging Face · open-source · 2025-08-08
No feed summary available yet.
High signal Matched: multi-gpu, gpu, training
SkyPilot · open-source · 2025-07-16
This is Part 2 of our series on the evolution of AI Job Orchestration. In Part 1, we explored how Neoclouds are democratizing GPU access but leaving the “last mile” unsolved. Now we’ll discover how AI-native orchestration...
High signal Matched: infiniband, performance, cost, gpu, cloud
Modal · inference-infra · 2025-07-11
Welcome to another round of Modal Product Updates! Here's what's new this month.
High signal Matched: multi-node, b200, release, training
llm-d · open-source · 2025-05-20
Introducing llm-d: Kubernetes-native distributed LLM inference with KV-cache routing, disaggregated serving, and SOTA performance per dollar. Built on vLLM.
High signal Matched: inference, serving, distributed, performance, introducing, sota
llm-d · open-source · 2025-05-20
Red Hat launches llm-d: Open source distributed AI inference platform backed by NVIDIA, Google Cloud, IBM. Scale generative AI with intelligent routing on Kubernetes.
High signal Matched: inference, distributed, release, cloud, open source
SkyPilot · open-source · 2025-03-20
How to accelerate distributed embedding generation? Use the "forgotten" regions.
High signal Matched: inference, generation, distributed
AIBrix · open-source · 2025-03-10
This blog post introduces deploying DeepSeek R1 using AIBrix. DeepSeek-R1 demonstrates remarkable proficiency in reasoning tasks through step-by-step training process. It features 671B total parameters with 37B active parameters, and 128k...
High signal Matched: inference, distributed, benchmark, model, weights, training, context length
AIBrix · open-source · 2025-02-19
We’re excited to announce the v0.2.0 release of AIBrix! Building on feedback from v0.1.0 production adoption and user interest, this release introduces several new features to enhance performance and usability. Extend the vLLM Prefix...
High signal Matched: inference, serving, prefill, throughput, distributed, multi-node, kv cache, prefix cache, performance, cost, gpu, accelerator, release, agent
Hugging Face · open-source · 2022-10-21
No feed summary available yet.
High signal Matched: distributed, training, distributed training
Hugging Face · open-source · 2021-11-19
No feed summary available yet.
High signal Matched: distributed, fine-tuning
Hugging Face · open-source · 2021-04-08
No feed summary available yet.
High signal Matched: distributed, sagemaker, training, distributed training
LY Corporation Tech Blog · korea · 2026-03-11
In November 2025, mobile engineers from our Tokyo and Ho Chi Minh City (HCMC) Development Centers ca...
Watchlist Matched: distributed