Moreh · korea · 2026-06-03
Optimizing Long-Context Prefill on Multiple (Older-Generation) GPU Nodes
No feed summary available yet.
High signal Matched: prefill, generation, gpu, long-context
Moreh · korea · 2026-06-03
No feed summary available yet.
High signal Matched: prefill, generation, gpu, long-context
Moreh · korea · 2026-06-03
No feed summary available yet.
High signal Matched: performance, mi300x, evaluation
Moreh · korea · 2026-06-03
No feed summary available yet.
High signal Matched: performance, mi300x, evaluation
Gcore · cloud · 2026-06-03
No feed summary available yet.
High signal Matched: gpu, cloud, training
VESSL AI · korea · 2026-06-03
No feed summary available yet.
High signal Matched: gpu, agent
VESSL AI · korea · 2026-06-03
No feed summary available yet.
High signal Matched: gpu, cloud
VESSL AI · korea · 2026-06-03
No feed summary available yet.
High signal Matched: inference, gpu
VESSL AI · korea · 2026-06-03
No feed summary available yet.
High signal Matched: gb200, cloud
VESSL AI · korea · 2026-06-03
No feed summary available yet.
High signal Matched: gpu, cloud
VESSL AI · korea · 2026-06-03
No feed summary available yet.
High signal Matched: gpu, cloud
VESSL AI · korea · 2026-06-03
No feed summary available yet.
High signal Matched: gpu, introducing
Moreh · korea · 2026-06-03
No feed summary available yet.
High signal Matched: inference, mi300x
Perplexity Research · model-lab · 2026-06-03
No feed summary available yet.
High signal Matched: blackwell
VESSL AI · korea · 2026-06-03
No feed summary available yet.
High signal Matched: gpu
VESSL AI · korea · 2026-06-03
No feed summary available yet.
High signal Matched: gpu
Runpod · cloud · 2026-06-03
No feed summary available yet.
High signal Matched: multi-node, gpu
Vast.ai · cloud · 2026-06-03
No feed summary available yet.
High signal Matched: gpu, cloud
CoreWeave · cloud · 2026-06-03
No feed summary available yet.
High signal Matched: gpu
CoreWeave · cloud · 2026-06-03
No feed summary available yet.
High signal Matched: blackwell
Crusoe · cloud · 2026-06-03
No feed summary available yet.
High signal Matched: gb200
Crusoe · cloud · 2026-06-03
No feed summary available yet.
High signal Matched: b200
Crusoe · cloud · 2026-06-03
No feed summary available yet.
High signal Matched: h200
Crusoe · cloud · 2026-06-03
No feed summary available yet.
High signal Matched: h100
Crusoe · cloud · 2026-06-03
No feed summary available yet.
High signal Matched: mi300x
FriendliAI · inference-infra · 2026-06-03
No feed summary available yet.
High signal Matched: inference, kv cache, gpu
Lambda · cloud · 2026-06-03
Lambda workspaces help teams organize cloud resources, control access, and separate dev, staging, and production in shared GPU environments. A junior researcher kills a production training run. A contractor sees weights they shouldn't. If...
High signal Matched: gpu, introducing, weights, cloud, training
AWS Machine Learning Blog · cloud · 2026-06-02
If you’re iterating on deploying large language models (LLMs) on AWS GPU instances, you’ve probably noticed the larger the model to be loaded into GPU High Bandwidth Memory (HBM), the longer the painful wait until the GPUs are ready for in...
High signal Matched: inference, gpu, model
PyTorch Foundation · open-source · 2026-06-01
TL;DR: This case study demonstrates how LinkedIn re-architected its distributed linear programming solver, DuaLip, by developing a GPU-accelerated PyTorch version to handle extreme-scale optimization challenges like web applications. This...
High signal Matched: distributed, gpu
Lambda · cloud · 2026-06-01
When we design large GPU clusters, the network is no longer a background system. It's part of the compute envelope. At the 800G and NVIDIA GB300 NVL72 scale, the back-end fabric accounts for 86% of networking power in a three-layer cluster...
High signal Matched: generation, token generation, throughput, infiniband, gpu, model, retrieval, agentic
NVIDIA Technical Blog · hardware · 2026-06-01
Each wave of AI has created a new scaling law. Pretraining scaled intelligence through larger datasets, more parameters, and massively parallel GPU systems....
High signal Matched: gpu, pretraining, agentic
AMD ROCm Blogs · hardware · 2026-06-01
Reinforcement learning (RL) is rapidly becoming a foundational technology for Large Language Models (LLMs)—powering key abilities such as reasoning and agentic behaviors. As RL workloads grow more complex and computationally intensive, the...
High signal Matched: performance, gpu, agentic
AWS Machine Learning Blog · cloud · 2026-05-30
This post demonstrates a comprehensive observability solution using Amazon Managed Grafana dashboards that provides a holistic view of both quality and quantity for LLMs served on Amazon SageMaker AI endpoints with inference components.
High signal Matched: inference, gpu, sagemaker
Nota AI · korea · 2026-05-29
Jaehoon Lee Technical Content Manager, Nota AI When enterprises adopt AI, the most common bottleneck is not model development. It is the deployment stage: getting a finished model to run reliably on the actual target device.T...
High signal Matched: inference, throughput, benchmark, performance, latency, cost, gpu, model, evaluation, quantization, int8, benchmarks, leaderboard
Together AI · inference-infra · 2026-05-29
Together AI built the fastest speech-to-text stack on Artificial Analysis by treating ASR as a full-path systems problem, not just a GPU inference problem.
High signal Matched: inference, gpu
AMD ROCm Blogs · hardware · 2026-05-29
Speculative speculative decoding (SSD) [1] is a recently proposed speculative decoding (SD) algorithm that further accelerates large language model (LLM) inference beyond conventional SD. In standard SD, a small draft model proposes severa...
High signal Matched: inference, decoding, speculative decoding, draft model, verification, cost, mi300x, model
AMD ROCm Blogs · hardware · 2026-05-29
Quantum computing offers a fundamentally different approach to computational problems by leveraging quantum mechanical properties such as superposition and entanglement. Unlike a classical bit, which is always 0 or 1, a qubit can exist in...
High signal Matched: benchmark, cost, gpu
PyTorch Foundation · open-source · 2026-05-28
When you use PyTorch’s compiler, your model runs faster, up to 10x faster. But what’s actually happening? Without compilation, the GPU runs a kernel, a function on the GPU, for...
High signal Matched: kernel, gpu, model
PyTorch Foundation · open-source · 2026-05-28
TL;DR: The TokenSpeed inference engine achieved a record-breaking 580 tps running the Qwen3.5-397B-A17B model on GPUs. This extreme performance for agentic workloads is driven by systematic elimination of memory copies,...
High signal Matched: inference, performance, gpu, model, agentic
NVIDIA Technical Blog · hardware · 2026-05-27
Large language models (LLMs) are revolutionizing the financial trading landscape by enabling sophisticated analysis of vast amounts of unstructured data to...
High signal Matched: inference, blackwell
PyTorch Foundation · open-source · 2026-05-26
Code available at: https://github.com/facebookresearch/ads_model_kernel_library In this post, we present the design of TLX Block Attention — a Triton kernel targeting NVIDIA Blackwell GPUs that exploits compile-time knowledge of a block-di...
High signal Matched: kernel, triton, blackwell, model
NVIDIA Technical Blog · hardware · 2026-05-26
Developers can now use NVIDIA CUDA Tile programming within large existing C++ GPU codebases to develop highly optimized GPU kernels using tile-based...
High signal Matched: cuda, performance, gpu
NVIDIA Technical Blog · hardware · 2026-05-26
NVIDIA CUDA 13.3 brings new capabilities and performance optimizations to developers across the CUDA ecosystem. The launch of NVIDIA CUDA Tile programming in...
High signal Matched: cuda, performance, gpu, launch
NVIDIA Technical Blog · hardware · 2026-05-26
Precision medicine depends on two fundamental capabilities: understanding disease at the genomic level and identifying treatments at the molecular level. ...
High signal Matched: blackwell
AMD ROCm Blogs · hardware · 2026-05-25
Local large language model (LLM) inference has rapidly evolved, but a persistent limitation remains: model size is constrained by available GPU memory. Discrete GPUs typically offer 8–24 GB of dedicated VRAM, which can limit the size of mo...
High signal Matched: inference, multi-gpu, gpu, model, checkpoint, cloud, quantization, evaluate
NVIDIA Technical Blog · hardware · 2026-05-21
Maximizing the value of AI infrastructure demands deep visibility into GPU utilization. Yet many platform teams running AI workloads on Kubernetes operate with...
High signal Matched: gpu
NVIDIA Technical Blog · hardware · 2026-05-21
As AI models grow in scale and complexity, realizing the full performance of modern accelerated infrastructure depends as much on how workloads are placed as on...
High signal Matched: performance, gb200
Lambda · cloud · 2026-05-20
What the numbers mean for financial services Executive summary Lambda is the first to publish an audited STAC-AI™ LANG6 result on NVIDIA HGX B200, with independently verified performance data that Financial Services Industry (FSI) infrastr...
High signal Matched: inference, generation, performance, gpu, h200, b200, model, evaluating
AMD ROCm Blogs · hardware · 2026-05-20
AMD released ROCm Core 7.13, the AMD GPU Driver 31.30, and AMD GPU Virtualization 9.0. With these releases, ROCm software expands hardware support across enterprise datacenters. The platform introduces AMD’s latest Instinct accelerators, e...
High signal Matched: performance, gpu, rocm, open-source
PyTorch Foundation · open-source · 2026-05-19
TL;DR: Introducing the ExecuTorch MLX Delegate The new MLX delegate enables optimized, GPU-accelerated inference for PyTorch models on Apple Silicon Macs, using Apple’s MLX framework. The delegate seamlessly integrates with...
High signal Matched: inference, gpu, introducing
LMCache · open-source · 2026-05-13
A practitioner’s guide to KV-cache tiering on ROCm — what works, what doesn’t, and the regime where it actually matters. Key Summary We benchmarked multi-turn agentic workloads using 739 anonymized Claude Code conversation trac...
High signal Matched: lmcache, moe, mi300x, rocm, fp8, agentic
NVIDIA Technical Blog · hardware · 2026-05-11
The compute capability of large GPU fleets presents unprecedented opportunities to innovate and provide value to customers in record time. Yet these...
High signal Matched: gpu, introducing
Nota AI · korea · 2026-05-11
Jaehoon Lee Technical Content Manager, Nota AI NetsPresso® now embraces AI agents. An easy-to-use interface sits on top of the validated pipeline that handles everything from model compression to device deployment.When a user...
High signal Matched: inference, endpoint, kernel, verification, moe, benchmark, latency, cost, gpu, release, model, evaluation, quantization, quantized, int4, evaluate, benchmarks, swe-bench, mmlu, agent, agents, api
Together AI · inference-infra · 2026-05-11
DeepSeek-V4 makes million-token context a serving-systems problem. Together AI explores the inference work behind V4 on NVIDIA HGX B200, including compressed KV layouts, prefix caching, kernel maturity, and endpoint profiles for long-conte...
High signal Matched: inference, serving, endpoint, kernel, b200, long-context
vLLM Project · open-source · 2026-05-11
TurboQuant, a method for KV-cache quantization, recently gained significant traction in the community due to the large advertised savings in GPU memory from very low bit-width quantization of a...
High signal Matched: performance, gpu, quantization
Together AI · inference-infra · 2026-05-08
Learn how to deploy any Hugging Face model in one session using Goose and Together's Dedicated Container Inference. Skip the setup complexity — one prompt gets your model running in a production-grade GPU environment on release day.
High signal Matched: inference, gpu, release, model
NVIDIA Technical Blog · hardware · 2026-05-07
NVIDIA GB200 NVL72 introduces a fundamentally new way to build GPU clusters by extending NVIDIA NVLink coherence across an entire rack. This design enables...
High signal Matched: gpu, gb200
NVIDIA Technical Blog · hardware · 2026-05-07
Distributed deep learning depends on fast, reliable GPU-to-GPU communication using the NVIDIA Collective Communication Library (NCCL). When training slows down,...
High signal Matched: distributed, nccl, performance, gpu, training
Modal · inference-infra · 2026-05-04
If we've said it once, we've said it once per millisecond: never block the GPU.
High signal Matched: inference, performance, gpu
NVIDIA Technical Blog · hardware · 2026-04-30
NVIDIA CUDA Tile (cuTile) is a tile-based programming model that enables developers to write GPU kernels in terms of tile-level operations—loads, stores, and...
High signal Matched: kernel, cuda, gpu, model, agents
NVIDIA Technical Blog · hardware · 2026-04-28
For decades, computational biology has operated under a reductionist compromise. To fit complex biological systems into the limited memory of a single GPU,...
High signal Matched: gpu
NVIDIA Technical Blog · hardware · 2026-04-24
DeepSeek just launched its fourth generation of flagship models with DeepSeek-V4-Pro and DeepSeek-V4-Flash, both targeted at enabling highly efficient...
High signal Matched: generation, gpu, blackwell
LMCache · open-source · 2026-04-23
Overview Large language model (LLM) inference performance depends heavily on how efficiently the system manages key-value (KV) cache — the stored attention states that allow the model to avoid recomputing previous tokens. As context length...
High signal Matched: inference, kv cache, lmcache, performance, latency, gpu, model, sagemaker
NVIDIA Technical Blog · hardware · 2026-04-22
AI integration is redefining mainstream enterprise applications, from productivity software like Microsoft Office to more complex design and engineering tools....
High signal Matched: blackwell
Nota AI · korea · 2026-04-22
Jaehoon Lee Technical Content Manager, Nota AI Series Notice: NetsPresso® Technical Blog, Part 2In Part 1, we walked through a scenario of deploying Llama 3.2 1B on an edge device to illustrate the NetsPresso® workflow. The f...
High signal Matched: inference, kernel, cuda, matmul, benchmark, performance, latency, cost, npu, model, weights, paper, research, evaluation, furiosa, training, quantization, int8, int4, awq, gptq, sdk, open-source
vLLM Project · open-source · 2026-04-22
Long-context LLM serving is increasingly memory-bound: for standard full-attention decoders, the KV cache often dominates GPU memory at 128k+ contexts, and each decode step must read a large...
High signal Matched: serving, kv cache, gpu, fp8, quantization, long-context
SkyPilot · open-source · 2026-04-22
Introducing GPU Compass: One dashboard to browse, compare pricing, and launch across every GPU cloud.
High signal Matched: gpu, introducing, launch, cloud
llm-d · open-source · 2026-04-21
How migrating from a simple vLLM deployment to a robust MLOps platform utilizing KServe, llm-d's intelligent routing, and vLLM solved significant scaling and operational challenges in LLM deployment through deep customization and prefix-ca...
High signal Matched: inference, gpu
Together AI · inference-infra · 2026-04-21
Learn how AI-native companies design multi-tenant GPU clusters that pool capacity without sacrificing team isolation — and how Together AI makes it work in practice.
High signal Matched: gpu
LMCache · open-source · 2026-04-16
TL;DR: TurboQuant allows you to put 4x more context in your GPU without blowing up GPU memory or dropping AI’s intelligence. It does so by quantizing the memory of large language models, also known as KV cache, an important bottleneck ment...
High signal Matched: inference, kv cache, lmcache, gpu
NVIDIA Technical Blog · hardware · 2026-04-14
When you’re writing CUDA applications, one of the most important things you need to focus on to write great code is data transfer performance. This applies to...
High signal Matched: cuda, performance, gpu
Modular · inference-infra · 2026-04-13
TileTensor Part 1 - Safer, More Efficient GPU Kernels
High signal Matched: gpu
NVIDIA Technical Blog · hardware · 2026-04-09
Slurm is an open source cluster management and job scheduling system for Linux. It manages job scheduling for over 65% of TOP500 systems. Most organizations...
High signal Matched: gpu, open source
Nota AI · korea · 2026-04-08
Jaehoon Lee Technical Content Manager, Nota AI AI Model Optimization: Why Models Won't Run on HardwareThe Chip Is Ready, but the Model Won't DeployIf you have ever tried deploying an AI model onto your own chip, the following...
High signal Matched: inference, multi-gpu, kv cache, verification, performance, latency, gpu, model, research, evaluation, quantization, quantized, awq, gptq, evaluate
NVIDIA Technical Blog · hardware · 2026-04-07
The NVIDIA GB200 NVL72 and NVIDIA GB300 NVL72 systems, featuring NVIDIA Blackwell architecture, are rack-scale supercomputers. They’re designed with 18...
High signal Matched: gb200, blackwell
vLLM Project · open-source · 2026-04-07
TL;DR: Prefill and decode fight over the same GPUs, causing ITL spikes under load. We show how to disaggregate them on a single 8-GPU MI300X node using AMD's MORI-IO connector — achieving 2.5x...
High signal Matched: inference, prefill, itl, gpu, mi300x
NVIDIA Technical Blog · hardware · 2026-04-02
In vision AI systems, model throughput continues to improve. The surrounding pipeline stages must keep pace, including decode, preprocessing, and GPU...
High signal Matched: throughput, gpu, model
Rebellions · hardware · 2026-04-02
Summary Challenge 석유 및 가스 산업이 발달한 중동 지역에서는 원유 생산 과정에서 불가피하게 발생하는 폐수와 기름을 처리해야 합니다. 특히, 저수지와... The post NPU 서버 기반 피지컬 AI, 아랍에미리트(UAE) 수질 정화 로봇 솔루션 appeared first on Rebellions.
High signal Matched: npu, rebellions
NVIDIA Technical Blog · hardware · 2026-04-01
In today’s AI factory environment, performance is not theoretical. It is economic, competitive, and existential. A 1% drop in usable GPU time can mean...
High signal Matched: performance, gpu
Together AI · inference-infra · 2026-04-01
The team behind FlashAttention and ThunderKittens — how Together AI's kernel researchers close the gap between GPU hardware and production AI.
High signal Matched: kernel, flashattention, gpu
NVIDIA Technical Blog · hardware · 2026-03-31
Spatial computing is moving from visualization to active collaboration, adding increasingly more GPU demands on XR hardware to render photorealistic,...
High signal Matched: gpu
Nota AI · korea · 2026-03-31
Jaehoon Lee Technical Content Manager, Nota AI In March, a single official announcement from Google Research rocked trillions of won in the market capitalization of U.S. infrastructure and semiconductor stocks. The catalyst:...
High signal Matched: inference, serving, generation, throughput, kv cache, benchmark, performance, cost, b200, blackwell, introducing, model, fp8, research, training, fine-tuning, quantization, quantized, agent, agentic, frontier model
Modular · inference-infra · 2026-03-30
Software Pipelining for GPU Kernels: Part 1 - The Pipeline Problem
High signal Matched: gpu
NVIDIA Technical Blog · hardware · 2026-03-25
In production Kubernetes environments, the difference between model requirements and GPU size creates inefficiencies. Lightweight automatic speech recognition...
High signal Matched: throughput, gpu, model
Nota AI · korea · 2026-03-23
Jaehoon Lee Technical Content Manager, Nota AI GTC has evolved far beyond a technology conference, drawing attention from global economies and financial markets alike. This year, CEO Jensen Huang took the stage in his tradema...
High signal Matched: inference, prefill, generation, throughput, cuda, kv cache, performance, latency, cost, gpu, npu, launch, model, research, cloud, training, long-context, context window, agent, agents, agentic, open-source
SkyPilot · open-source · 2026-03-19
Karpathy's autoresearch runs one experiment at a time. We gave it access to our GPU infra and let it run experiments in parallel.
High signal Matched: gpu, agent
NVIDIA Technical Blog · hardware · 2026-03-16
NVIDIA Groq 3 LPX is a new rack-scale inference accelerator for the NVIDIA Vera Rubin platform, designed for the low-latency and large-context demands of...
High signal Matched: inference, latency, accelerator
Modular · inference-infra · 2026-03-16
Modular at NVIDIA GTC 2026: MAX on Blackwell, Mojo Kernel Porting, and DeepSeek V3 on B200
High signal Matched: kernel, b200, blackwell
Nota AI · korea · 2026-03-13
Hancheol Park, Ph. D. AI Research Engineer, Nota AI Tairen PiaoAI Research Engineer, Nota AI Tae-Ho KimCTO & Co-Founder, Nota AI ✔️ Resource : The official quantized model of Solar-Open-100B, which passed the first round of Sout...
High signal Matched: inference, serving, prefill, generation, throughput, moe, router, benchmark, performance, latency, ttft, tpot, blackwell, release, model, weights, open model, research, evaluation, korea, korean, upstage, training, post-training, quantization, quantized, int4, evaluate, benchmarks, mmlu, long-context
Together AI · inference-infra · 2026-03-10
Together GPU Clusters now include built-in autoscaling, RBAC, full-stack observability, and self-healing node repair—giving teams production-ready GPU infrastructure that scales efficiently, stays resilient, and supports shared enterprise...
High signal Matched: gpu
Together AI · inference-infra · 2026-03-05
As GPU throughput outpaces memory bandwidth, kernels must evolve. We introduce FlashAttention-4, featuring new pipelining for maximum overlap, 2-CTA MMA modes to reduce shared memory traffic, and a hardware-software hybrid approach to soft...
High signal Matched: throughput, kernel, flashattention, gpu
vLLM Project · open-source · 2026-02-27
For a long time, enabling AMD support meant "porting"; i.e. just making code run. That era is over.
High signal Matched: inference, performance, rocm
vLLM Project · open-source · 2026-02-26
Organizations and individuals running multiple custom AI models, especially recent Mixture of Experts (MoE) model families, can face the challenge of paying for idle GPU capacity when the...
High signal Matched: serve, moe, mixture of experts, gpu, model, sagemaker, bedrock
SkyPilot · open-source · 2026-02-21
SkyPilot Admin Policies let you enforce cost controls, security rules, and compliance requirements automatically — without slowing down your engineering team.
High signal Matched: cost, gpu
vLLM Project · open-source · 2026-02-13
DeepSeek-V3.2 (NVFP4 + TP2)has been successfully and smoothly run on GB300 (SM103 - Blackwell Ultra). Leveraging FP4 quantization, it achieves a single-GPU throughput of 7360 TGS (tokens / GPU /...
High signal Matched: throughput, deepseek-v3, performance, gpu, blackwell, quantization
llm-d · open-source · 2026-02-10
llm-d's new filesystem backend offloads KV cache to shared storage, enabling cross-replica reuse and up to 16.8x faster TTFT — scaling inference throughput without GPU or CPU memory limits.
High signal Matched: inference, throughput, kv cache, ttft, gpu
vLLM Project · open-source · 2026-02-03
Building on our previous work achieving 2.2k tok/s/H200 decode throughput with wide-EP, the vLLM team has continued performance optimization efforts targeting NVIDIA's GB200 platform. This blog...
High signal Matched: serving, throughput, performance, h200, gb200, blackwell
vLLM Project · open-source · 2026-02-01
TL;DR: In collaboration with the open-source community, vLLM + NVIDIA has achieved significant performance milestones on the gpt-oss-120b model running on NVIDIA's Blackwell GPUs. Through deep...
High signal Matched: performance, blackwell, model, open-source, oss
Together AI · inference-infra · 2026-01-22
Learn how to reduce inference latency without massive cost using proven inference optimization tactics — improving throughput, GPU utilization, and cost efficiency while balancing throughput vs. latency tradeoffs.
High signal Matched: inference, throughput, latency, cost, gpu
Modular · inference-infra · 2026-01-14
How to Beat Unsloth's CUDA Kernel Using Mojo—With Zero GPU Experience
High signal Matched: kernel, cuda, gpu
Together AI · inference-infra · 2026-01-13
Together AI teamed with Cursor to build the real-time inference stack that keeps in-editor agents fast and reliable. They productionized NVIDIA Blackwell (B200/GB200), tuning ARM hosts, kernels, and FP4/TensorRT quantization for low latenc...
High signal Matched: inference, latency, b200, gb200, blackwell, model, quantization, agents
Together AI · inference-infra · 2026-01-12
Learn how foundation models are trained at scale using multi-node GPU clusters, including distributed training techniques, infrastructure requirements, and practical steps to scale training efficiently.
High signal Matched: distributed, multi-node, gpu, model, training, distributed training
SqueezeBits · korea · 2025-12-24
Introducing ATOM™-Max, rebellions’ next-generation NPU designed for high-performance AI inference. Learn how its runtime, profiling tools, and PyTorch-native integrations enable developers to run and serve models efficiently without sacrif...
High signal Matched: inference, generation, serve, performance, npu, introducing, rebellions
Nota AI · korea · 2025-12-19
Seungmin YangEdgeFM Lead, Nota AI On this page ▾ SummaryWith the introduction of NVFP4—a new 4-bit floating point data type in NVIDIA’s Blackwell GPU architecture—LLM inference achieves markedly improved efficiency.Blackwell’s NVFP4...
High signal Matched: inference, serving, decoding, prefill, generation, token generation, throughput, kernel, gemm, cutlass, distributed, benchmark, performance, latency, ttft, tpot, tokens/sec, cost, gpu, blackwell, launch, model, weights, fp8, research, training, post-training, quantization, quantized, awq, benchmarks, mmlu, retrieval
vLLM Project · open-source · 2025-12-17
In v0.11.0, the last code from vLLM V0 engine was removed, marking the complete migration to the improved V1 engine architecture. This achievement would not have been possible without vLLM’s...
High signal Matched: serving, h200
vLLM Project · open-source · 2025-12-03
Several months ago, we published a blog post about CUDA Core Dump: An Effective Tool to Debug Memory Access Issues and Beyond, introducing a powerful technique for debugging illegal memory access...
High signal Matched: cuda, gpu, introducing
Modal · inference-infra · 2025-12-02
We've partnered with Mistral to bring you Day 0 support for Mistral 3, with GPU-snapshot-optimized performance.
High signal Matched: performance, gpu
llm-d · open-source · 2025-12-02
llm-d v0.4 delivers 50% lower latency for MoE models via speculative decoding, expands TPU and XPU support, and adds prefix cache offloading for faster TTFT.
High signal Matched: decoding, prefix cache, speculative decoding, moe, performance, latency, ttft, tpu, sota
Together AI · inference-infra · 2025-12-01
Together AI achieves up to 2x faster inference for top open-source models like Qwen, DeepSeek, and Kimi through GPU optimization, advanced speculative decoding, and FP4 quantization—ranking #1 in speed benchmarks on NVIDIA Blackwell archit...
High signal Matched: inference, decoding, speculative decoding, gpu, blackwell, quantization, benchmarks, open-source
Rebellions · hardware · 2025-11-20
Summary Challenge 최근 반려동물 양육 인구의 증가로 X-ray 영상 진단 수요가 빠르게 확대되고 있습니다. 그러나 국내 영상의학 전공 수의사는 수백... The post NPU로 구동되는 AI 기반 동물 영상 진단 보조 서비스 appeared first on Rebellions.
High signal Matched: npu, rebellions
Modular · inference-infra · 2025-11-20
Modular 25.7: Faster Inference, Safer GPU Programming, and a More Unified Developer Experience
High signal Matched: inference, gpu
Modal · inference-infra · 2025-11-19
Learn how Reducto used GPU memory snapshotting and flexible autoscaling to build fast multi-model pipelines.
High signal Matched: latency, gpu, model
Modal · inference-infra · 2025-11-18
Never block the GPU.
High signal Matched: inference, gpu
Hugging Face · open-source · 2025-11-17
No feed summary available yet.
High signal Matched: rocm
Rebellions · hardware · 2025-11-07
리벨리온 NPU에서 직접 경험한 LLM 추론의 새로운 가능성 지난 8월 vLLM Korea Meetup에 이어, 10월 29일 리벨리온과 스퀴즈비츠 주관으로 vLLM... The post vLLM Hands-on Workshop WrapUp appeared first on Rebellions.
High signal Matched: npu, korea, rebellions
SqueezeBits · korea · 2025-10-28
Explore how Intel’s new Gaudi-3 compares to Gaudi-2, NVIDIA A100, and H100. We analyze real-world GEMM efficiency, attention performance, and LLM serving results to uncover what truly matters for AI inference and training workloads.
High signal Matched: inference, serving, gemm, performance, h100, training
SkyPilot · open-source · 2025-10-21
AWS Batch works well for traditional enterprise batch processing (see their case studies 1 and 2). But AI workloads have different requirements - they’re more interactive, need flexible GPU access, and benefit from simpler iteration...
High signal Matched: inference, gpu
Google Research · big-tech · 2025-10-15
Generative AI
High signal Matched: npu
Together AI · inference-infra · 2025-10-15
We've launched the Together AI Startup Accelerator: Up to $50K credits, expert engineering hours, GTM support, community and VC access for AI-native apps in build–scale tiers.
High signal Matched: accelerator
Together AI · inference-infra · 2025-10-10
LLM inference that gets faster as you use it. Our runtime-learning accelerator adapts continuously to your workload, delivering 500 TPS on DeepSeek-V3.1, a 4x speedup over baseline performance without manual tuning.
High signal Matched: inference, deepseek-v3, performance, accelerator
llm-d · open-source · 2025-10-10
llm-d v0.3 adds Google TPU and Intel XPU support, wide expert parallelism at 2.2k tokens/sec per GPU, predicted latency scheduling, and Inference Gateway GA.
High signal Matched: inference, latency, tokens/sec, gpu, tpu
Modular · inference-infra · 2025-09-19
Matrix Multiplication on Blackwell: Part 4 - Breaking SOTA
High signal Matched: blackwell, sota
Modal · inference-infra · 2025-09-16
Exploring the internals of our new product, a modern Jupyter notebook built for fast startup and real-time collaboration.
High signal Matched: gpu, cloud
SkyPilot · open-source · 2025-09-12
SkyPilot now supports detailed GPU metrics across multiple Kubernetes clusters in the dashboard for better observability.
High signal Matched: gpu
Modular · inference-infra · 2025-09-12
Matrix Multiplication on Blackwell: Part 3 - The Optimizations Behind 85% of SOTA Performance
High signal Matched: performance, blackwell, sota
Together AI · inference-infra · 2025-09-09
Together AI launches Instant Clusters: self-service GPU clusters with NVIDIA H100/B200, ready in minutes for training or inference at any scale.
High signal Matched: inference, gpu, h100, b200, training
Modular · inference-infra · 2025-09-05
Matrix Multiplication on Blackwell: Part 2 - Using Hardware Features to Optimize Matmul
High signal Matched: matmul, blackwell
Modular · inference-infra · 2025-08-28
Matrix Multiplication on Blackwell: Part 1 - Introduction
High signal Matched: blackwell
SqueezeBits · korea · 2025-08-26
In this article, we introduce how to run LLMs efficiently on Apple Silicon with disaggregated inference technique.
High signal Matched: inference, prefill, gpu, npu
Rebellions · hardware · 2025-08-21
비전 모델과 언어 모델을 결합한 멀티모달, GPU와 NPU를 결합한 하이브리드 인프라로 기존 시스템의 제약을 극복하는 차별화된 AI 기반 안전 관제 시스템, ‘AI 비전 인텔리전스'를 개발한 코오롱베니트의 사례 The post AI로 예방 중심의 건설 & 플랜트 프로젝트 현장 안전 관리 실현 appeared first on Rebellions.
High signal Matched: gpu, npu, rebellions
SkyPilot · open-source · 2025-08-21
Avataar's enterprise AI content platform cut costs 11x and unlocked GPU capacity by migrating from inflexible SLURM deployment to SkyPilot's multi-cloud infrastructure.
High signal Matched: gpu, cloud
Hugging Face · open-source · 2025-08-18
No feed summary available yet.
High signal Matched: cuda, gpu
Modal · inference-infra · 2025-08-11
Welcome to another round of Modal Product Updates! Here's what's new this month.
High signal Matched: gpu
Hugging Face · open-source · 2025-08-08
No feed summary available yet.
High signal Matched: multi-gpu, gpu, training
AIBrix · open-source · 2025-08-05
AIBrix is a composable, cloud‑native LLM inference infrastructure designed to deliver high performance and low cost at scale. We now present a major update in a new release - v0.4.0. This release tackles key bottlenecks in orchestration an...
High signal Matched: inference, prefill, generation, token generation, throughput, performance, cost, gpu, release, cloud
Modal · inference-infra · 2025-07-30
Using GPU snapshots to enable sub-second container startup times.
High signal Matched: gpu
Together AI · inference-infra · 2025-07-17
Together AI inference is now among the world’s fastest, most capable platforms for running open-source reasoning models like DeepSeek-R1 at scale, thanks to our new inference engine designed for NVIDIA HGX B200.
High signal Matched: inference, b200, blackwell, open-source
SkyPilot · open-source · 2025-07-16
This is Part 2 of our series on the evolution of AI Job Orchestration. In Part 1, we explored how Neoclouds are democratizing GPU access but leaving the “last mile” unsolved. Now we’ll discover how AI-native orchestration...
High signal Matched: infiniband, performance, cost, gpu, cloud
Modal · inference-infra · 2025-07-11
Welcome to another round of Modal Product Updates! Here's what's new this month.
High signal Matched: multi-node, b200, release, training
SkyPilot · open-source · 2025-07-08
If you’re an infrastructure or MLOps engineer at a large company, you know the drill. The ML team comes to you with requirements that change weekly. They need GPUs yesterday, but the budget was set six months ago. They want to use th...
High signal Matched: cost, gpu
Hugging Face · open-source · 2025-06-03
No feed summary available yet.
High signal Matched: gpu
Modal · inference-infra · 2025-05-30
We’re excited to be making Nvidia B200 and H200 GPUs available on Modal starting today!
High signal Matched: h200, b200, introducing
Modular · inference-infra · 2025-05-29
Modverse #48: Modular Platform 25.3, MAX AI Kernels, and the Modular GPU Kernel Hackathon
High signal Matched: kernel, gpu
Modular · inference-infra · 2025-05-20
Modular GPU Kernel Hackathon Highlights: Innovation, Community, & Mojo🔥
High signal Matched: kernel, gpu
Replicate · inference-infra · 2025-05-16
NVIDIA H100 GPUs are here, with better performance and lower cost.
High signal Matched: performance, cost, h100
Together AI · inference-infra · 2025-04-24
No feed summary available yet.
High signal Matched: blackwell
Modular · inference-infra · 2025-04-17
Modverse #47: MAX 25.2 and an evening of GPU programming at Modular HQ
High signal Matched: gpu
Modular · inference-infra · 2025-03-25
MAX 25.2: Unleash the power of your H200's–without CUDA!
High signal Matched: cuda, h200
Modal · inference-infra · 2025-02-24
A guide to maximizing the utilization of GPUs, from cloud allocations to FLOP/s.
High signal Matched: gpu, cloud
Modal · inference-infra · 2025-02-21
GPU documentation for the people, now by the people.
High signal Matched: gpu
AIBrix · open-source · 2025-02-19
We’re excited to announce the v0.2.0 release of AIBrix! Building on feedback from v0.1.0 production adoption and user interest, this release introduces several new features to enhance performance and usability. Extend the vLLM Prefix...
High signal Matched: inference, serving, prefill, throughput, distributed, multi-node, kv cache, prefix cache, performance, cost, gpu, accelerator, release, agent
SqueezeBits · korea · 2025-01-13
In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.
High signal Matched: performance, accelerator, fp8, quantization, evaluate
SqueezeBits · korea · 2025-01-06
In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.
High signal Matched: performance, accelerator, evaluation, evaluate
Hugging Face · open-source · 2024-12-24
No feed summary available yet.
High signal Matched: gpu
Modular · inference-infra · 2024-12-17
Introducing MAX 24.6: A GPU Native Generative AI Platform
High signal Matched: gpu, introducing
Modular · inference-infra · 2024-12-17
MAX GPU: State of the Art Throughput on a New GenAI platform
High signal Matched: throughput, gpu, state of the art
SqueezeBits · korea · 2024-12-03
In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.
High signal Matched: performance, accelerator, evaluation, evaluate
SqueezeBits · korea · 2024-11-21
In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.
High signal Matched: performance, accelerator, evaluate
AIBrix · open-source · 2024-11-13
In recent years, large language models (LLMs) have revolutionized AI applications, powering solutions in areas like chatbots, automated content generation, and advanced recommendation engines. Services like OpenAI’s have gained significant...
High signal Matched: decoding, prefill, generation, kv cache, performance, cost, gpu, release, introducing, cloud, open-source
Modal · inference-infra · 2024-10-25
Why Modal is obsessed with serverless AI infrastructure
High signal Matched: gpu
Modal · inference-infra · 2024-08-06
...and we're passing the savings to you. 15-30% price cuts on GPUs and CPUs.
High signal Matched: gpu
Modal · inference-infra · 2024-06-20
Isolate your tasks with Modal containers while using Airflow for orchestration.
High signal Matched: gpu
Replicate · inference-infra · 2024-06-12
We'll soon support NVIDIA's H100 GPUs for predictions and training. Let us know if you want early access.
High signal Matched: h100, training
Hugging Face · open-source · 2024-05-21
No feed summary available yet.
High signal Matched: gpu
SqueezeBits · korea · 2024-04-23
The Blackwell GPU from GTC 2024 was astonishing.Analysis of the Nvidia GPU evolution & what it means for GPU users.
High signal Matched: gpu, blackwell
Hugging Face · open-source · 2024-04-02
No feed summary available yet.
High signal Matched: inference, gpu
Hugging Face · open-source · 2024-03-18
No feed summary available yet.
High signal Matched: h100, cloud
Hugging Face · open-source · 2024-02-29
No feed summary available yet.
High signal Matched: generation, accelerator
Modal · inference-infra · 2024-02-06
We’re excited to be making Nvidia H100 GPUs available on Modal starting today!
High signal Matched: h100, introducing
SkyPilot · open-source · 2023-12-21
A tutorial for serving Mixtral 8x7B model with SkyPilot and SkyServe.
High signal Matched: serving, mixtral, cost, gpu, model
Hugging Face · open-source · 2023-12-05
No feed summary available yet.
High signal Matched: gpu
Hugging Face · open-source · 2023-10-03
No feed summary available yet.
High signal Matched: inference, tpu, cloud
Hugging Face · open-source · 2023-06-13
No feed summary available yet.
High signal Matched: gpu
SkyPilot · open-source · 2023-05-30
Announcing SkyPilot 0.3: LLM support, new clouds, and enhanced production readiness.
High signal Matched: gpu
Hugging Face · open-source · 2023-05-15
No feed summary available yet.
High signal Matched: gpu, rocm
Hugging Face · open-source · 2023-03-28
No feed summary available yet.
High signal Matched: inference, accelerator
Hugging Face · open-source · 2023-03-09
No feed summary available yet.
High signal Matched: gpu, rlhf, fine-tuning
Replicate · inference-infra · 2022-08-31
How to run Stable Diffusion locally so you can hack on it
High signal Matched: gpu
Lambda · cloud · 2026-04-30
Harnesses If you've used Claude Code or Codex, you've used a harness. A harness is the infrastructure layer that wraps an AI coding agent and decides how it operates, what it can touch, and how you measure whether it worked. It's how most...
Watchlist Matched: gpu, training, post-training, agent, agents, open-source
Modal · inference-infra · 2026-04-07
Product updates, community highlights, and upcoming events.
Watchlist Matched: blackwell, api
Together AI · inference-infra · 2025-09-10
Hiring Mahadev Konar further deepens Together AI’s commitment to deliver the most reliable and scalable GPU infrastructure.
Watchlist Matched: gpu
Replicate · inference-infra · 2024-06-14
Copy and paste a few commands into terminal to play with Stable Diffusion 3 on your own GPU-powered machine.
Watchlist Matched: gpu