hardware

inference hardware model-release

High signal Matched: gpu, introducing, weights, cloud, training

AWS Machine Learning Blog · cloud · 2026-06-02

Accelerate LLM model loading and increase context windows with GPUDirect on Amazon FSx for Lustre and TurboQuant

Score 15

If you’re iterating on deploying large language models (LLMs) on AWS GPU instances, you’ve probably noticed the larger the model to be loaded into GPU High Bandwidth Memory (HBM), the longer the painful wait until the GPUs are ready for in...

High signal Matched: inference, gpu, model

PyTorch Foundation · open-source · 2026-06-01

How LinkedIn Uses PyTorch to Solve Extreme-Scale Optimization Problems

Score 11

TL;DR: This case study demonstrates how LinkedIn re-architected its distributed linear programming solver, DuaLip, by developing a GPU-accelerated PyTorch version to handle extreme-scale optimization challenges like web applications. This...

distributed hardware

inference serving distributed benchmark hardware model-release rag agents

High signal Matched: distributed, gpu

Lambda · cloud · 2026-06-01

Unbox one of NVIDIA's first co-packaged optics switches with us. See why we bet on CPO early.

Score 15

When we design large GPU clusters, the network is no longer a background system. It's part of the compute envelope. At the 800G and NVIDIA GB300 NVL72 scale, the back-end fabric accounts for 86% of networking power in a three-layer cluster...

High signal Matched: generation, token generation, throughput, infiniband, gpu, model, retrieval, agentic

NVIDIA Technical Blog · hardware · 2026-06-01

NVIDIA Vera CPU Sets a New Standard for Agentic Workloads in AI Factories

Score 11

Each wave of AI has created a new scaling law. Pretraining scaled intelligence through larger datasets, more parameters, and massively parallel GPU systems....

hardware training agents

benchmark hardware agents

High signal Matched: gpu, pretraining, agentic

AMD ROCm Blogs · hardware · 2026-06-01

Out-of-the-Box ROLL Support on AMD GPUs: Accelerating Reinforcement Learning at Scale

Score 13

Reinforcement learning (RL) is rapidly becoming a foundational technology for Large Language Models (LLMs)—powering key abilities such as reasoning and agentic behaviors. As RL workloads grow more complex and computationally intensive, the...

High signal Matched: performance, gpu, agentic

AWS Machine Learning Blog · cloud · 2026-05-30

Comprehensive observability for Amazon SageMaker AI LLM inference: From GPU utilization to LLM quality

Score 17

This post demonstrates a comprehensive observability solution using Amazon Managed Grafana dashboards that provides a holistic view of both quality and quantity for LLMs served on Amazon SageMaker AI endpoints with inference components.

inference hardware cloud

inference serving benchmark hardware model-release research quantization evals

High signal Matched: inference, gpu, sagemaker

Nota AI · korea · 2026-05-29

Full-Stack Optimization for Low-Light Video on Jetson Orin NX: From 400 ms to 28 ms

Score 23

  Jaehoon Lee Technical Content Manager, Nota AI   When enterprises adopt AI, the most common bottleneck is not model development. It is the deployment stage: getting a finished model to run reliably on the actual target device.T...

High signal Matched: inference, throughput, benchmark, performance, latency, cost, gpu, model, evaluation, quantization, int8, benchmarks, leaderboard

Together AI · inference-infra · 2026-05-29

How Together AI built the world’s fastest speech-to-text stack

Score 13

Together AI built the fastest speech-to-text stack on Artificial Analysis by treating ASR as a full-path systems problem, not just a GPU inference problem.

inference speculative-decoding benchmark hardware model-release

High signal Matched: inference, gpu

AMD ROCm Blogs · hardware · 2026-05-29

Enabling Speculative Speculative Decoding on MI300X

Score 29

Speculative speculative decoding (SSD) [1] is a recently proposed speculative decoding (SD) algorithm that further accelerates large language model (LLM) inference beyond conventional SD. In standard SD, a small draft model proposes severa...

High signal Matched: inference, decoding, speculative decoding, draft model, verification, cost, mi300x, model

AMD ROCm Blogs · hardware · 2026-05-29

Running Variational Quantum Eigensolver with Qiskit Aer on AMD Instinct

Score 13

Quantum computing offers a fundamentally different approach to computational problems by leveraging quantum mechanical properties such as superposition and entanglement. Unlike a classical bit, which is always 0 or 1, a qubit can exist in...

kernel hardware model-release

High signal Matched: benchmark, cost, gpu

PyTorch Foundation · open-source · 2026-05-28

Why Is PyTorch Compile So Fast: Kernel Fusion

Score 15

When you use PyTorch’s compiler, your model runs faster, up to 10x faster. But what’s actually happening? Without compilation, the GPU runs a kernel, a function on the GPU, for...

inference benchmark hardware model-release agents

High signal Matched: kernel, gpu, model

PyTorch Foundation · open-source · 2026-05-28

Up to 580tps! New Speed Record of Qwen3.5-397B-A17B on GPU for Agentic Workloads with TokenSpeed

Score 17

TL;DR: The TokenSpeed inference engine achieved a record-breaking 580 tps running the Qwen3.5-397B-A17B model on GPUs. This extreme performance for agentic workloads is driven by systematic elimination of memory copies,...

High signal Matched: inference, performance, gpu, model, agentic

NVIDIA Technical Blog · hardware · 2026-05-27

NVIDIA Blackwell Sets STAC-AI Record for LLM Inference in Finance

Score 17

Large language models (LLMs) are revolutionizing the financial trading landscape by enabling sophisticated analysis of vast amounts of unstructured data to...

kernel triton hardware model-release

High signal Matched: inference, blackwell

PyTorch Foundation · open-source · 2026-05-26

TLX Block Attention: A Warp-Specialized Blackwell Kernel for Fixed-Block Sparse Self-Attention

Score 18

Code available at: https://github.com/facebookresearch/ads_model_kernel_library In this post, we present the design of TLX Block Attention — a Triton kernel targeting NVIDIA Blackwell GPUs that exploits compile-time knowledge of a block-di...

kernel cuda benchmark hardware

High signal Matched: kernel, triton, blackwell, model

NVIDIA Technical Blog · hardware · 2026-05-26

Develop High-Performance GPU Kernels in C++ with NVIDIA CUDA Tile

Score 21

Developers can now use NVIDIA CUDA Tile programming within large existing C++  GPU codebases to develop highly optimized GPU kernels using tile-based...

kernel cuda benchmark hardware model-release

High signal Matched: cuda, performance, gpu

NVIDIA Technical Blog · hardware · 2026-05-26

NVIDIA CUDA 13.3 Enhances GPU Development with Tile Programming in C++, Compiler Autotuning, and Python Updates

Score 21

NVIDIA CUDA 13.3 brings new capabilities and performance optimizations to developers across the CUDA ecosystem. The launch of NVIDIA CUDA Tile programming in...

High signal Matched: cuda, performance, gpu, launch

NVIDIA Technical Blog · hardware · 2026-05-26

Run Key Genomics and Protein Folding Workloads Faster with NVIDIA RTX PRO 4500 Blackwell

Score 13

Precision medicine depends on two fundamental capabilities: understanding disease at the genomic level and identifying treatments at the molecular level. ...

inference distributed hardware model-release cloud quantization evals

High signal Matched: blackwell

AMD ROCm Blogs · hardware · 2026-05-25

AI Inference on AMD Ryzen™ AI Max Processor

Score 20

Local large language model (LLM) inference has rapidly evolved, but a persistent limitation remains: model size is constrained by available GPU memory. Discrete GPUs typically offer 8–24 GB of dedicated VRAM, which can limit the size of mo...

High signal Matched: inference, multi-gpu, gpu, model, checkpoint, cloud, quantization, evaluate

NVIDIA Technical Blog · hardware · 2026-05-21

Get Real-Time Visibility into GPU Usage Across Kubernetes Clusters

Score 12

Maximizing the value of AI infrastructure demands deep visibility into GPU utilization. Yet many platform teams running AI workloads on Kubernetes operate with...

High signal Matched: gpu

NVIDIA Technical Blog · hardware · 2026-05-21

Unlock Exascale Performance on NVIDIA GB200 NVL72 with Slurm Topology-Aware Job Scheduling

Score 16

As AI models grow in scale and complexity, realizing the full performance of modern accelerated infrastructure depends as much on how workloads are placed as on...

inference benchmark hardware model-release evals

High signal Matched: performance, gb200

Lambda · cloud · 2026-05-20

Lambda’s NVIDIA HGX B200 on STAC-AI™ LANG6

Score 18

What the numbers mean for financial services Executive summary Lambda is the first to publish an audited STAC-AI™ LANG6 result on NVIDIA HGX B200, with independently verified performance data that Financial Services Industry (FSI) infrastr...

benchmark hardware open-source

High signal Matched: inference, generation, performance, gpu, h200, b200, model, evaluating

AMD ROCm Blogs · hardware · 2026-05-20

ROCm 7.13: Expanding Hardware, Tools, and Reach

Score 14

AMD released ROCm Core 7.13, the AMD GPU Driver 31.30, and AMD GPU Virtualization 9.0. With these releases, ROCm software expands hardware support across enterprise datacenters. The platform introduces AMD’s latest Instinct accelerators, e...

inference hardware model-release

High signal Matched: performance, gpu, rocm, open-source

PyTorch Foundation · open-source · 2026-05-19

Running PyTorch Models on Apple Silicon GPUs with the ExecuTorch MLX Delegate

Score 14

TL;DR: Introducing the ExecuTorch MLX Delegate The new MLX delegate enables optimized, GPU-accelerated inference for PyTorch models on Apple Silicon Macs, using Apple’s MLX framework. The delegate seamlessly integrates with...

kv-cache moe hardware model-release quantization agents

High signal Matched: inference, gpu, introducing

LMCache · open-source · 2026-05-13

Benchmarking LMCache for Multi-Turn Agentic Workloads on AMD MI300X

Score 20

A practitioner’s guide to KV-cache tiering on ROCm — what works, what doesn’t, and the regime where it actually matters. Key Summary We benchmarked multi-turn agentic workloads using 739 anonymized Claude Code conversation trac...

High signal Matched: lmcache, moe, mi300x, rocm, fp8, agentic

NVIDIA Technical Blog · hardware · 2026-05-11

Introducing NVIDIA Fleet Intelligence for Real-Time GPU Fleet Visibility and Optimization

Score 16

The compute capability of large GPU fleets presents unprecedented opportunities to innovate and provide value to customers in record time. Yet these...

inference serving kernel speculative-decoding moe benchmark hardware model-release research quantization evals agents api

High signal Matched: gpu, introducing

Nota AI · korea · 2026-05-11

[NetsPresso® x AI Agents] Easier to Use, Even More Powerful

Score 52

  Jaehoon Lee Technical Content Manager, Nota AI   NetsPresso® now embraces AI agents. An easy-to-use interface sits on top of the validated pipeline that handles everything from model compression to device deployment.When a user...

inference serving kernel hardware long-context api

High signal Matched: inference, endpoint, kernel, verification, moe, benchmark, latency, cost, gpu, release, model, evaluation, quantization, quantized, int4, evaluate, benchmarks, swe-bench, mmlu, agent, agents, api

Together AI · inference-infra · 2026-05-11

Serving DeepSeek-V4: why million-token context is an inference systems problem

Score 22

DeepSeek-V4 makes million-token context a serving-systems problem. Together AI explores the inference work behind V4 on NVIDIA HGX B200, including compressed KV layouts, prefix caching, kernel maturity, and endpoint profiles for long-conte...

benchmark hardware quantization

High signal Matched: inference, serving, endpoint, kernel, b200, long-context

vLLM Project · open-source · 2026-05-11

A First Comprehensive Study of TurboQuant: Accuracy and Performance

Score 14

TurboQuant, a method for KV-cache quantization, recently gained significant traction in the community due to the large advertised savings in GPU memory from very low bit-width quantization of a...

inference hardware model-release

High signal Matched: performance, gpu, quantization

Together AI · inference-infra · 2026-05-08

Deploy and inference any model from HuggingFace

Score 20

Learn how to deploy any Hugging Face model in one session using Goose and Together's Dedicated Container Inference. Skip the setup complexity — one prompt gets your model running in a production-grade GPU environment on release day.

High signal Matched: inference, gpu, release, model

NVIDIA Technical Blog · hardware · 2026-05-07

Achieving Peak System and Workload Efficiency on NVIDIA GB200 NVL72 with Slurm Block Scheduling

Score 14

NVIDIA GB200 NVL72 introduces a fundamentally new way to build GPU clusters by extending NVIDIA NVLink coherence across an entire rack. This design enables...

distributed benchmark hardware training

High signal Matched: gpu, gb200

NVIDIA Technical Blog · hardware · 2026-05-07

Real-Time Performance Monitoring and Faster Debugging with NCCL Inspector and Prometheus

Score 20

Distributed deep learning depends on fast, reliable GPU-to-GPU communication using the NVIDIA Collective Communication Library (NCCL). When training slows down,...

High signal Matched: distributed, nccl, performance, gpu, training

Modal · inference-infra · 2026-05-04

Boosting multimodal inference performance by >10% with a single Python dictionary

Score 16

If we've said it once, we've said it once per millisecond: never block the GPU.

kernel cuda hardware model-release agents

High signal Matched: inference, performance, gpu

NVIDIA Technical Blog · hardware · 2026-04-30

Automating GPU Kernel Translation with AI Agents: cuTile Python to cuTile.jl

Score 20

NVIDIA CUDA Tile (cuTile) is a tile-based programming model that enables developers to write GPU kernels in terms of tile-level operations—loads, stores, and...

High signal Matched: kernel, cuda, gpu, model, agents

NVIDIA Technical Blog · hardware · 2026-04-28

Scaling Biomolecular Modeling Using Context Parallelism in NVIDIA BioNeMo

Score 10

For decades, computational biology has operated under a reductionist compromise. To fit complex biological systems into the limited memory of a single GPU,...

High signal Matched: gpu

NVIDIA Technical Blog · hardware · 2026-04-24

Build with DeepSeek V4 Using NVIDIA Blackwell and GPU-Accelerated Endpoints

Score 18

DeepSeek just launched its fourth generation of flagship models with DeepSeek-V4-Pro and DeepSeek-V4-Flash, both targeted at enabling highly efficient...

inference kv-cache benchmark hardware model-release cloud

High signal Matched: generation, gpu, blackwell

LMCache · open-source · 2026-04-23

LMCache on Amazon SageMaker HyperPod: Accelerating LLM Inference with Managed Tiered KV Cache

Score 30

Overview Large language model (LLM) inference performance depends heavily on how efficiently the system manages key-value (KV) cache — the stored attention states that allow the model to avoid recomputing previous tokens. As context length...

High signal Matched: inference, kv cache, lmcache, performance, latency, gpu, model, sagemaker

NVIDIA Technical Blog · hardware · 2026-04-22

Scaling the AI-Ready Data Center with NVIDIA RTX PRO 4500 Blackwell Server Edition and NVIDIA vGPU 20

Score 12

AI integration is redefining mainstream enterprise applications, from productivity software like Microsoft Office to more complex design and engineering tools....

High signal Matched: blackwell

Nota AI · korea · 2026-04-22

[Deep Dive: NetsPresso®] From Quantization to Graph Optimization: A Step-by-Step Model Deployment Pipeline

Score 54

  Jaehoon Lee Technical Content Manager, Nota AI   Series Notice: NetsPresso® Technical Blog, Part 2In Part 1, we walked through a scenario of deploying Llama 3.2 1B on an edge device to illustrate the NetsPresso® workflow. The f...

inference kernel cuda benchmark hardware model-release research korea training quantization evals api open-source

inference serving kv-cache hardware model-release quantization long-context

High signal Matched: inference, kernel, cuda, matmul, benchmark, performance, latency, cost, npu, model, weights, paper, research, evaluation, furiosa, training, quantization, int8, int4, awq, gptq, sdk, open-source

vLLM Project · open-source · 2026-04-22

The State of FP8 KV-Cache and Attention Quantization in vLLM

Score 18

Long-context LLM serving is increasingly memory-bound: for standard full-attention decoders, the KV cache often dominates GPU memory at 128k+ contexts, and each decode step must read a large...

hardware model-release cloud

High signal Matched: serving, kv cache, gpu, fp8, quantization, long-context

SkyPilot · open-source · 2026-04-22

GPU Compass: Navigate the GPU Frontier Across 20+ Clouds & 2K+ Offerings

Score 18

Introducing GPU Compass: One dashboard to browse, compare pricing, and launch across every GPU cloud.

High signal Matched: gpu, introducing, launch, cloud

llm-d · open-source · 2026-04-21

Production-Grade LLM Inference at Scale with KServe, llm-d, and vLLM

Score 14

How migrating from a simple vLLM deployment to a robust MLOps platform utilizing KServe, llm-d's intelligent routing, and vLLM solved significant scaling and operational challenges in LLM deployment through deep customization and prefix-ca...

High signal Matched: inference, gpu

Together AI · inference-infra · 2026-04-21

Capacity without conflict: A guide to multi-tenant GPU cluster design for AI-native teams

Score 12

Learn how AI-native companies design multi-tenant GPU clusters that pool capacity without sacrificing team isolation — and how Together AI makes it work in practice.

inference kv-cache hardware

High signal Matched: gpu

LMCache · open-source · 2026-04-16

What is TurboQuant and why it matters for LLM inference, in laymen’s term

Score 16

TL;DR: TurboQuant allows you to put 4x more context in your GPU without blowing up GPU memory or dropping AI’s intelligence. It does so by quantizing the memory of large language models, also known as KV cache, an important bottleneck ment...

kernel cuda benchmark hardware

High signal Matched: inference, kv cache, lmcache, gpu

NVIDIA Technical Blog · hardware · 2026-04-14

NVIDIA NVbandwidth: Your Essential Tool for Measuring GPU Interconnect and Memory Performance

Score 18

When you’re writing CUDA applications, one of the most important things you need to focus on to write great code is data transfer performance. This applies to...

High signal Matched: cuda, performance, gpu

Modular · inference-infra · 2026-04-13

TileTensor Part 1 - Safer, More Efficient GPU Kernels

Score 10

TileTensor Part 1 - Safer, More Efficient GPU Kernels

High signal Matched: gpu

NVIDIA Technical Blog · hardware · 2026-04-09

Running Large-Scale GPU Workloads on Kubernetes with Slurm

Score 12

Slurm is an open source cluster management and job scheduling system for Linux. It manages job scheduling for over 65% of TOP500 systems. Most organizations...

hardware open-source

inference distributed kv-cache speculative-decoding benchmark hardware model-release research quantization evals

High signal Matched: gpu, open source

Nota AI · korea · 2026-04-08

[Overview: NetsPresso®] A Platform That Handles Everything from Model Optimization to Target Deployment

Score 36

  Jaehoon Lee Technical Content Manager, Nota AI   AI Model Optimization: Why Models Won't Run on HardwareThe Chip Is Ready, but the Model Won't DeployIf you have ever tried deploying an AI model onto your own chip, the following...

High signal Matched: inference, multi-gpu, kv cache, verification, performance, latency, gpu, model, research, evaluation, quantization, quantized, awq, gptq, evaluate

NVIDIA Technical Blog · hardware · 2026-04-07

Running AI Workloads on Rack-Scale Supercomputers: From Hardware to Topology-Aware Scheduling

Score 12

The NVIDIA GB200 NVL72 and NVIDIA GB300 NVL72 systems, featuring NVIDIA Blackwell architecture, are rack-scale supercomputers. They’re designed with 18...

High signal Matched: gb200, blackwell

vLLM Project · open-source · 2026-04-07

Next-Level Inference: Why Your Single-Node vLLM Setup Needs Prefill-Decode Disaggregation

Score 22

TL;DR: Prefill and decode fight over the same GPUs, causing ITL spikes under load. We show how to disaggregate them on a single 8-GPU MI300X node using AMD's MORI-IO connector — achieving 2.5x...

serving benchmark hardware model-release

High signal Matched: inference, prefill, itl, gpu, mi300x

NVIDIA Technical Blog · hardware · 2026-04-02

Accelerating Vision AI Pipelines with Batch Mode VC-6 and NVIDIA Nsight

Score 14

In vision AI systems, model throughput continues to improve. The surrounding pipeline stages must keep pace, including decode, preprocessing, and GPU...

High signal Matched: throughput, gpu, model

Rebellions · hardware · 2026-04-02

NPU 서버 기반 피지컬 AI, 아랍에미리트(UAE) 수질 정화 로봇 솔루션

Score 14

Summary Challenge 석유 및 가스 산업이 발달한 중동 지역에서는 원유 생산 과정에서 불가피하게 발생하는 폐수와 기름을 처리해야 합니다. 특히, 저수지와... The post NPU 서버 기반 피지컬 AI, 아랍에미리트(UAE) 수질 정화 로봇 솔루션 appeared first on Rebellions.

High signal Matched: npu, rebellions

NVIDIA Technical Blog · hardware · 2026-04-01

Accelerate Token Production in AI Factories Using Unified Services and Real-Time AI

Score 12

In today’s AI factory environment, performance is not theoretical. It is economic, competitive, and existential. A 1% drop in usable GPU time can mean...

High signal Matched: performance, gpu

Together AI · inference-infra · 2026-04-01

Inside the Together AI kernels team

Score 16

The team behind FlashAttention and ThunderKittens — how Together AI's kernel researchers close the gap between GPU hardware and production AI.

High signal Matched: kernel, flashattention, gpu

NVIDIA Technical Blog · hardware · 2026-03-31

Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0

Score 10

Spatial computing is moving from visualization to active collaboration, adding increasingly more GPU demands on XR hardware to render photorealistic,...

inference serving kv-cache benchmark hardware model-release research training fine-tuning quantization agents frontier-model

High signal Matched: gpu

Nota AI · korea · 2026-03-31

The Real Reason TurboQuant Shook the Market: AI Optimization Has Gone Mainstream

Score 46

  Jaehoon Lee Technical Content Manager, Nota AI   In March, a single official announcement from Google Research rocked trillions of won in the market capitalization of U.S. infrastructure and semiconductor stocks. The catalyst:...

High signal Matched: inference, serving, generation, throughput, kv cache, benchmark, performance, cost, b200, blackwell, introducing, model, fp8, research, training, fine-tuning, quantization, quantized, agent, agentic, frontier model

Modular · inference-infra · 2026-03-30

Software Pipelining for GPU Kernels: Part 1 - The Pipeline Problem

Score 10

Software Pipelining for GPU Kernels: Part 1 - The Pipeline Problem

serving benchmark hardware model-release

High signal Matched: gpu

NVIDIA Technical Blog · hardware · 2026-03-25

Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads

Score 18

In production Kubernetes environments, the difference between model requirements and GPU size creates inefficiencies. Lightweight automatic speech recognition...

High signal Matched: throughput, gpu, model

Nota AI · korea · 2026-03-23

[GTC 2026 Recap] The Trillion-Dollar Inference Race Begins: How Nota AI Fills the Gap

Score 42

  Jaehoon Lee Technical Content Manager, Nota AI   GTC has evolved far beyond a technology conference, drawing attention from global economies and financial markets alike. This year, CEO Jensen Huang took the stage in his tradema...

inference serving kernel cuda kv-cache benchmark hardware model-release research cloud training long-context agents open-source

High signal Matched: inference, prefill, generation, throughput, cuda, kv cache, performance, latency, cost, gpu, npu, launch, model, research, cloud, training, long-context, context window, agent, agents, agentic, open-source

SkyPilot · open-source · 2026-03-19

Scaling Karpathy's Autoresearch: What Happens When the Agent Gets a GPU Cluster

Score 10

Karpathy's autoresearch runs one experiment at a time. We gave it access to our GPU infra and let it run experiments in parallel.

hardware agents

High signal Matched: gpu, agent

NVIDIA Technical Blog · hardware · 2026-03-16

Inside NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin Platform

Score 20

NVIDIA Groq 3 LPX is a new rack-scale inference accelerator for the NVIDIA Vera Rubin platform, designed for the low-latency and large-context demands of...

High signal Matched: inference, latency, accelerator

Modular · inference-infra · 2026-03-16

Modular at NVIDIA GTC 2026: MAX on Blackwell, Mojo Kernel Porting, and DeepSeek V3 on B200

Score 18

Modular at NVIDIA GTC 2026: MAX on Blackwell, Mojo Kernel Porting, and DeepSeek V3 on B200

inference serving moe benchmark hardware model-release research korea training quantization evals long-context open-source

High signal Matched: kernel, b200, blackwell

Nota AI · korea · 2026-03-13

NotaMoEQuantization: An MoE-Specific Quantization Method for Solar-Open-100B

Score 62

  Hancheol Park, Ph. D. AI Research Engineer, Nota AI Tairen PiaoAI Research Engineer, Nota AI Tae-Ho KimCTO & Co-Founder, Nota AI ✔️ Resource : The official quantized model of Solar-Open-100B, which passed the first round of Sout...

High signal Matched: inference, serving, prefill, generation, throughput, moe, router, benchmark, performance, latency, ttft, tpot, blackwell, release, model, weights, open model, research, evaluation, korea, korean, upstage, training, post-training, quantization, quantized, int4, evaluate, benchmarks, mmlu, long-context

Together AI · inference-infra · 2026-03-10

New in Together GPU Clusters: Autoscaling, observability, and self-healing

Score 12

Together GPU Clusters now include built-in autoscaling, RBAC, full-stack observability, and self-healing node repair—giving teams production-ready GPU infrastructure that scales efficiently, stays resilient, and supports shared enterprise...

serving kernel benchmark hardware

High signal Matched: gpu

Together AI · inference-infra · 2026-03-05

FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling

Score 20

As GPU throughput outpaces memory bandwidth, kernels must evolve. We introduce FlashAttention-4, featuring new pipelining for maximum overlap, 2-CTA MMA modes to reduce shared memory traffic, and a hardware-software hybrid approach to soft...

High signal Matched: throughput, kernel, flashattention, gpu

vLLM Project · open-source · 2026-02-27

Beyond Porting: How vLLM Orchestrates High-Performance Inference on AMD ROCm

Score 20

For a long time, enabling AMD support meant "porting"; i.e. just making code run. That era is over.

serving moe hardware model-release cloud

High signal Matched: inference, performance, rocm

vLLM Project · open-source · 2026-02-26

Efficiently serve dozens of fine-tuned models with vLLM on Amazon SageMaker AI and Amazon Bedrock

Score 30

Organizations and individuals running multiple custom AI models, especially recent Mixture of Experts (MoE) model families, can face the challenge of paying for idle GPU capacity when the...

High signal Matched: serve, moe, mixture of experts, gpu, model, sagemaker, bedrock

SkyPilot · open-source · 2026-02-21

SkyPilot Admin Policies: Enforce GPU Governance Without Slowing Down AI Teams

Score 12

SkyPilot Admin Policies let you enforce cost controls, security rules, and compliance requirements automatically — without slowing down your engineering team.

serving moe benchmark hardware quantization

High signal Matched: cost, gpu

vLLM Project · open-source · 2026-02-13

DeepSeek-V3.2 on GB300: Performance Breakthrough

Score 22

DeepSeek-V3.2 (NVFP4 + TP2)has been successfully and smoothly run on GB300 (SM103 - Blackwell Ultra). Leveraging FP4 quantization, it achieves a single-GPU throughput of 7360 TGS (tokens / GPU /...

inference serving kv-cache benchmark hardware

High signal Matched: throughput, deepseek-v3, performance, gpu, blackwell, quantization

llm-d · open-source · 2026-02-10

Native KV Cache Offloading to Any Filesystem with llm-d

Score 20

llm-d's new filesystem backend offloads KV cache to shared storage, enabling cross-replica reuse and up to 16.8x faster TTFT — scaling inference throughput without GPU or CPU memory limits.

inference serving benchmark hardware

High signal Matched: inference, throughput, kv cache, ttft, gpu

vLLM Project · open-source · 2026-02-03

Driving vLLM WideEP and Large-Scale Serving Toward Maturity on Blackwell (Part I)

Score 24

Building on our previous work achieving 2.2k tok/s/H200 decode throughput with wide-EP, the vLLM team has continued performance optimization efforts targeting NVIDIA's GB200 platform. This blog...

benchmark hardware model-release open-source

High signal Matched: serving, throughput, performance, h200, gb200, blackwell

vLLM Project · open-source · 2026-02-01

GPT-OSS Performance Optimizations on NVIDIA Blackwell: Pushing the Pareto Frontier

Score 18

TL;DR: In collaboration with the open-source community, vLLM + NVIDIA has achieved significant performance milestones on the gpt-oss-120b model running on NVIDIA's Blackwell GPUs. Through deep...

inference serving benchmark hardware

High signal Matched: performance, blackwell, model, open-source, oss

Together AI · inference-infra · 2026-01-22

Optimizing inference speed and costs: Lessons learned from large-scale deployments

Score 22

Learn how to reduce inference latency without massive cost using proven inference optimization tactics — improving throughput, GPU utilization, and cost efficiency while balancing throughput vs. latency tradeoffs.

High signal Matched: inference, throughput, latency, cost, gpu

Modular · inference-infra · 2026-01-14

How to Beat Unsloth's CUDA Kernel Using Mojo—With Zero GPU Experience

Score 18

How to Beat Unsloth's CUDA Kernel Using Mojo—With Zero GPU Experience

kernel cuda hardware

inference benchmark hardware model-release quantization agents

High signal Matched: kernel, cuda, gpu

Together AI · inference-infra · 2026-01-13

Learn how Cursor partnered with Together AI to deliver real-time, low-latency inference at scale

Score 24

Together AI teamed with Cursor to build the real-time inference stack that keeps in-editor agents fast and reliable. They productionized NVIDIA Blackwell (B200/GB200), tuning ARM hosts, kernels, and FP4/TensorRT quantization for low latenc...

distributed hardware model-release training

High signal Matched: inference, latency, b200, gb200, blackwell, model, quantization, agents

Together AI · inference-infra · 2026-01-12

Inside multi-node training: How to scale model training across GPU clusters

Score 22

Learn how foundation models are trained at scale using multi-node GPU clusters, including distributed training techniques, infrastructure requirements, and practical steps to scale training efficiently.

inference serving benchmark hardware model-release korea

High signal Matched: distributed, multi-node, gpu, model, training, distributed training

SqueezeBits · korea · 2025-12-24

Introducing rebellions ATOM™-MAX

Score 24

Introducing ATOM™-Max, rebellions’ next-generation NPU designed for high-performance AI inference. Learn how its runtime, profiling tools, and PyTorch-native integrations enable developers to run and serve models efficiently without sacrif...

High signal Matched: inference, generation, serve, performance, npu, introducing, rebellions

Nota AI · korea · 2025-12-19

NVIDIA Blackwell; The Impact of NVFP4 For LLM Inference

Score 74

  Seungmin YangEdgeFM Lead, Nota AI On this page ▾ SummaryWith the introduction of NVFP4—a new 4-bit floating point data type in NVIDIA’s Blackwell GPU architecture—LLM inference achieves markedly improved efficiency.Blackwell’s NVFP4...

inference serving kernel cuda distributed benchmark hardware model-release research training quantization evals rag

inference serving hardware

High signal Matched: inference, serving, decoding, prefill, generation, token generation, throughput, kernel, gemm, cutlass, distributed, benchmark, performance, latency, ttft, tpot, tokens/sec, cost, gpu, blackwell, launch, model, weights, fp8, research, training, post-training, quantization, quantized, awq, benchmarks, mmlu, retrieval

vLLM Project · open-source · 2025-12-17

vLLM Large Scale Serving: DeepSeek @ 2.2k tok/s/H200 with Wide-EP

Score 16

In v0.11.0, the last code from vLLM V0 engine was removed, marking the complete migration to the improved V1 engine architecture. This achievement would not have been possible without vLLM’s...

kernel cuda hardware model-release

High signal Matched: serving, h200

vLLM Project · open-source · 2025-12-03

Tracing Hanging and Complicated GPU Kernels Down To The Source Code

Score 16

Several months ago, we published a blog post about CUDA Core Dump: An Effective Tool to Debug Memory Access Issues and Beyond, introducing a powerful technique for debugging illegal memory access...

High signal Matched: cuda, gpu, introducing

Modal · inference-infra · 2025-12-02

Modal + Mistral 3: 10x faster cold starts with GPU snapshotting

Score 12

We've partnered with Mistral to bring you Day 0 support for Mistral 3, with GPU-snapshot-optimized performance.

inference kv-cache speculative-decoding moe benchmark hardware frontier-model

High signal Matched: performance, gpu

llm-d · open-source · 2025-12-02

llm-d 0.4: Achieve SOTA Performance Across Accelerators

Score 30

llm-d v0.4 delivers 50% lower latency for MoE models via speculative decoding, expands TPU and XPU support, and adds prefix cache offloading for faster TTFT.

inference speculative-decoding hardware quantization evals open-source

High signal Matched: decoding, prefix cache, speculative decoding, moe, performance, latency, ttft, tpu, sota

Together AI · inference-infra · 2025-12-01

Together AI delivers fastest inference for the top open-source models

Score 20

Together AI achieves up to 2x faster inference for top open-source models like Qwen, DeepSeek, and Kimi through GPU optimization, advanced speculative decoding, and FP4 quantization—ranking #1 in speed benchmarks on NVIDIA Blackwell archit...

High signal Matched: inference, decoding, speculative decoding, gpu, blackwell, quantization, benchmarks, open-source

Rebellions · hardware · 2025-11-20

NPU로 구동되는 AI 기반 동물 영상 진단 보조 서비스

Score 14

Summary Challenge 최근 반려동물 양육 인구의 증가로 X-ray 영상 진단 수요가 빠르게 확대되고 있습니다. 그러나 국내 영상의학 전공 수의사는 수백... The post NPU로 구동되는 AI 기반 동물 영상 진단 보조 서비스 appeared first on Rebellions.

High signal Matched: npu, rebellions

Modular · inference-infra · 2025-11-20

Modular 25.7: Faster Inference, Safer GPU Programming, and a More Unified Developer Experience

Score 14

Modular 25.7: Faster Inference, Safer GPU Programming, and a More Unified Developer Experience

benchmark hardware model-release

High signal Matched: inference, gpu

Modal · inference-infra · 2025-11-19

How Reducto improved enterprise-scale document processing latency by 3x

Score 14

Learn how Reducto used GPU memory snapshotting and flexible autoscaling to build fast multi-model pipelines.

High signal Matched: latency, gpu, model

Modal · inference-infra · 2025-11-18

Host overhead is killing your inference efficiency

Score 12

Never block the GPU.

High signal Matched: inference, gpu

Hugging Face · open-source · 2025-11-17

Easily Build and Share ROCm Kernels with Hugging Face

Score 10

No feed summary available yet.

High signal Matched: rocm

Rebellions · hardware · 2025-11-07

vLLM Hands-on Workshop WrapUp

Score 14

리벨리온 NPU에서 직접 경험한 LLM 추론의 새로운 가능성 지난 8월 vLLM Korea Meetup에 이어, 10월 29일 리벨리온과 스퀴즈비츠 주관으로 vLLM... The post vLLM Hands-on Workshop WrapUp appeared first on Rebellions.

inference serving kernel benchmark hardware training

High signal Matched: npu, korea, rebellions

SqueezeBits · korea · 2025-10-28

[Intel Gaudi] #6. GEMM, Attention, vLLM on Gaudi

Score 20

Explore how Intel’s new Gaudi-3 compares to Gaudi-2, NVIDIA A100, and H100. We analyze real-world GEMM efficiency, attention performance, and LLM serving results to uncover what truly matters for AI inference and training workloads.

High signal Matched: inference, serving, gemm, performance, h100, training

SkyPilot · open-source · 2025-10-21

Why AWS Batch Doesn't Work for Modern AI Workloads: A Technical Comparison with SkyPilot

Score 10

AWS Batch works well for traditional enterprise batch processing (see their case studies 1 and 2). But AI workloads have different requirements - they’re more interactive, need flexible GPU access, and benefit from simpler iteration...

High signal Matched: inference, gpu

Google Research · big-tech · 2025-10-15

Coral NPU: A full-stack platform for Edge AI

Score 8

Generative AI

High signal Matched: npu

Together AI · inference-infra · 2025-10-15

Announcing the Together AI Startup Accelerator, purpose-built for AI Native Apps

Score 12

We've launched the Together AI Startup Accelerator: Up to $50K credits, expert engineering hours, GTM support, community and VC access for AI-native apps in build–scale tiers.

inference moe benchmark hardware

High signal Matched: accelerator

Together AI · inference-infra · 2025-10-10

AdapTive-LeArning Speculator System (ATLAS): A New Paradigm in LLM Inference via Runtime-Learning Accelerators

Score 20

LLM inference that gets faster as you use it. Our runtime-learning accelerator adapts continuously to your workload, delivering 500 TPS on DeepSeek-V3.1, a 4x speedup over baseline performance without manual tuning.

High signal Matched: inference, deepseek-v3, performance, accelerator

llm-d · open-source · 2025-10-10

llm-d 0.3: Wider Well-Lit Paths for Scalable Inference

Score 20

llm-d v0.3 adds Google TPU and Intel XPU support, wide expert parallelism at 2.2k tokens/sec per GPU, predicted latency scheduling, and Inference Gateway GA.

High signal Matched: inference, latency, tokens/sec, gpu, tpu

Modular · inference-infra · 2025-09-19

Matrix Multiplication on Blackwell: Part 4 - Breaking SOTA

Score 10

Matrix Multiplication on Blackwell: Part 4 - Breaking SOTA

hardware frontier-model

High signal Matched: blackwell, sota

Modal · inference-infra · 2025-09-16

Inside Modal Notebooks: How we built a cloud GPU notebook that boots in seconds

Score 14

Exploring the internals of our new product, a modern Jupyter notebook built for fast startup and real-time collaboration.

High signal Matched: gpu, cloud

SkyPilot · open-source · 2025-09-12

Unlocking GPU Metrics in Kubernetes with SkyPilot

Score 10

SkyPilot now supports detailed GPU metrics across multiple Kubernetes clusters in the dashboard for better observability.

benchmark hardware frontier-model

High signal Matched: gpu

Modular · inference-infra · 2025-09-12

Matrix Multiplication on Blackwell: Part 3 - The Optimizations Behind 85% of SOTA Performance

Score 14

Matrix Multiplication on Blackwell: Part 3 - The Optimizations Behind 85% of SOTA Performance

inference hardware training

High signal Matched: performance, blackwell, sota

Together AI · inference-infra · 2025-09-09

Announcing General Availability of Together Instant Clusters, offering ready to use, self-service NVIDIA GPUs

Score 18

Together AI launches Instant Clusters: self-service GPU clusters with NVIDIA H100/B200, ready in minutes for training or inference at any scale.

High signal Matched: inference, gpu, h100, b200, training

Modular · inference-infra · 2025-09-05

Matrix Multiplication on Blackwell: Part 2 - Using Hardware Features to Optimize Matmul

Score 14

Matrix Multiplication on Blackwell: Part 2 - Using Hardware Features to Optimize Matmul

High signal Matched: matmul, blackwell

Modular · inference-infra · 2025-08-28

Matrix Multiplication on Blackwell: Part 1 - Introduction

Score 10

Matrix Multiplication on Blackwell: Part 1 - Introduction

High signal Matched: blackwell

SqueezeBits · korea · 2025-08-26

Disaggregated Inference on Apple Silicon: NPU prefill and GPU decode

Score 22

In this article, we introduce how to run LLMs efficiently on Apple Silicon with disaggregated inference technique.

High signal Matched: inference, prefill, gpu, npu

Rebellions · hardware · 2025-08-21

AI로 예방 중심의 건설 & 플랜트 프로젝트 현장 안전 관리 실현

Score 14

비전 모델과 언어 모델을 결합한 멀티모달, GPU와 NPU를 결합한 하이브리드 인프라로 기존 시스템의 제약을 극복하는 차별화된 AI 기반 안전 관제 시스템, ‘AI 비전 인텔리전스'를 개발한 코오롱베니트의 사례 The post AI로 예방 중심의 건설 & 플랜트 프로젝트 현장 안전 관리 실현 appeared first on Rebellions.

High signal Matched: gpu, npu, rebellions

SkyPilot · open-source · 2025-08-21

From SLURM to SkyPilot: How Avataar cut costs 11x with multi-cloud AI infrastructure

Score 12

Avataar's enterprise AI content platform cut costs 11x and unlocked GPU capacity by migrating from inflexible SLURM deployment to SkyPilot's multi-cloud infrastructure.

High signal Matched: gpu, cloud

Hugging Face · open-source · 2025-08-18

From Zero to GPU: A Guide to Building and Scaling Production-Ready CUDA Kernels

Score 14

No feed summary available yet.

kernel cuda hardware

High signal Matched: cuda, gpu

Modal · inference-infra · 2025-08-11

Product updates: GPU memory snapshots, notebooks, service tokens, and more

Score 10

Welcome to another round of Modal Product Updates! Here's what's new this month.

distributed hardware training

High signal Matched: gpu

Hugging Face · open-source · 2025-08-08

Accelerate ND-Parallel: A guide to Efficient Multi-GPU Training

Score 14

No feed summary available yet.

inference serving benchmark hardware model-release cloud

High signal Matched: multi-gpu, gpu, training

AIBrix · open-source · 2025-08-05

AIBrix v0.4.0 Release: P/D Disaggregation and Expert Parallelism Support, KVCache v1 Connector, KV Event Synchronization & Multi‑Engine Support

Score 20

AIBrix is a composable, cloud‑native LLM inference infrastructure designed to deliver high performance and low cost at scale. We now present a major update in a new release - v0.4.0. This release tackles key bottlenecks in orchestration an...

High signal Matched: inference, prefill, generation, token generation, throughput, performance, cost, gpu, release, cloud

Modal · inference-infra · 2025-07-30

GPU Memory Snapshots: Supercharging sub-second startup

Score 10

Using GPU snapshots to enable sub-second container startup times.

inference hardware open-source

High signal Matched: gpu

Together AI · inference-infra · 2025-07-17

Together AI Delivers Top Speeds for DeepSeek-R1-0528 Inference on NVIDIA Blackwell

Score 18

Together AI inference is now among the world’s fastest, most capable platforms for running open-source reasoning models like DeepSeek-R1 at scale, thanks to our new inference engine designed for NVIDIA HGX B200.

distributed benchmark hardware cloud

High signal Matched: inference, b200, blackwell, open-source

SkyPilot · open-source · 2025-07-16

The Evolution of AI Job Orchestration. Part 2: The AI-Native Control Plane & Orchestration that Finally Works for ML

Score 16

This is Part 2 of our series on the evolution of AI Job Orchestration. In Part 1, we explored how Neoclouds are democratizing GPU access but leaving the “last mile” unsolved. Now we’ll discover how AI-native orchestration...

distributed hardware model-release training

High signal Matched: infiniband, performance, cost, gpu, cloud

Modal · inference-infra · 2025-07-11

Product updates: Multi-node training clusters, B200 and H200s, and Client 1.0 release

Score 18

Welcome to another round of Modal Product Updates! Here's what's new this month.

High signal Matched: multi-node, b200, release, training

SkyPilot · open-source · 2025-07-08

The Evolution of AI Job Orchestration. Part 1: Running AI jobs on GPU Neoclouds

Score 12

If you’re an infrastructure or MLOps engineer at a large company, you know the drill. The ML team comes to you with requirements that change weekly. They need GPUs yesterday, but the budget was set six months ago. They want to use th...

High signal Matched: cost, gpu

Hugging Face · open-source · 2025-06-03

No GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL

Score 10

No feed summary available yet.

High signal Matched: gpu

Modal · inference-infra · 2025-05-30

Introducing: B200s and H200s on Modal

Score 18

We’re excited to be making Nvidia B200 and H200 GPUs available on Modal starting today!

High signal Matched: h200, b200, introducing

Modular · inference-infra · 2025-05-29

Modverse #48: Modular Platform 25.3, MAX AI Kernels, and the Modular GPU Kernel Hackathon

Score 14

Modverse #48: Modular Platform 25.3, MAX AI Kernels, and the Modular GPU Kernel Hackathon

High signal Matched: kernel, gpu

Modular · inference-infra · 2025-05-20

Modular GPU Kernel Hackathon Highlights: Innovation, Community, & Mojo🔥

Score 14

Modular GPU Kernel Hackathon Highlights: Innovation, Community, & Mojo🔥

High signal Matched: kernel, gpu

Replicate · inference-infra · 2025-05-16

NVIDIA H100 GPUs are here

Score 12

NVIDIA H100 GPUs are here, with better performance and lower cost.

High signal Matched: performance, cost, h100

Together AI · inference-infra · 2025-04-24

Salesforce, Zoom, InVideo Train Faster with Together AI Turbocharged with NVIDIA Blackwell

Score 12

No feed summary available yet.

High signal Matched: blackwell

Modular · inference-infra · 2025-04-17

Modverse #47: MAX 25.2 and an evening of GPU programming at Modular HQ

Score 10

Modverse #47: MAX 25.2 and an evening of GPU programming at Modular HQ

High signal Matched: gpu

Modular · inference-infra · 2025-03-25

MAX 25.2: Unleash the power of your H200's–without CUDA!

Score 14

MAX 25.2: Unleash the power of your H200's–without CUDA!

kernel cuda hardware

High signal Matched: cuda, h200

Modal · inference-infra · 2025-02-24

'I paid for the whole GPU, I am going to use the whole GPU': A high-level guide to GPU utilization

Score 12

A guide to maximizing the utilization of GPUs, from cloud allocations to FLOP/s.

High signal Matched: gpu, cloud

Modal · inference-infra · 2025-02-21

We open sourced the GPU Glossary

Score 10

GPU documentation for the people, now by the people.

inference serving distributed kv-cache benchmark hardware model-release agents

High signal Matched: gpu

AIBrix · open-source · 2025-02-19

AIBrix v0.2.0 Release: Distributed KV Cache, Orchestration and Heterogeneous GPU Support

Score 42

We’re excited to announce the v0.2.0 release of AIBrix! Building on feedback from v0.1.0 production adoption and user interest, this release introduces several new features to enhance performance and usability. Extend the vLLM Prefix...

benchmark hardware model-release quantization evals

High signal Matched: inference, serving, prefill, throughput, distributed, multi-node, kv cache, prefix cache, performance, cost, gpu, accelerator, release, agent

SqueezeBits · korea · 2025-01-13

[Intel Gaudi] #4. FP8 Quantization

Score 20

In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.

benchmark hardware research evals

High signal Matched: performance, accelerator, fp8, quantization, evaluate

SqueezeBits · korea · 2025-01-06

[Intel Gaudi] #3. Performance Evaluation with SynapseAI v1.19

Score 18

In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.

High signal Matched: performance, accelerator, evaluation, evaluate

Hugging Face · open-source · 2024-12-24

Visualize and understand GPU memory in PyTorch

Score 10

No feed summary available yet.

High signal Matched: gpu

Modular · inference-infra · 2024-12-17

Introducing MAX 24.6: A GPU Native Generative AI Platform

Score 14

Introducing MAX 24.6: A GPU Native Generative AI Platform

serving benchmark hardware frontier-model

High signal Matched: gpu, introducing

Modular · inference-infra · 2024-12-17

MAX GPU: State of the Art Throughput on a New GenAI platform

Score 14

MAX GPU: State of the Art Throughput on a New GenAI platform

benchmark hardware research evals

High signal Matched: throughput, gpu, state of the art

SqueezeBits · korea · 2024-12-03

[Intel Gaudi] #2. Graph Compiler and Overall Performance Evaluation

Score 18

In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.

High signal Matched: performance, accelerator, evaluation, evaluate

SqueezeBits · korea · 2024-11-21

[Intel Gaudi] #1. Introduction

Score 12

In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.

benchmark hardware evals

inference kv-cache benchmark hardware model-release cloud open-source

High signal Matched: performance, accelerator, evaluate

AIBrix · open-source · 2024-11-13

Introducing AIBrix v0.1.0: Building the Future of Scalable, Cost-Effective AI Infrastructure for Large Models

Score 32

In recent years, large language models (LLMs) have revolutionized AI applications, powering solutions in areas like chatbots, automated content generation, and advanced recommendation engines. Services like OpenAI’s have gained significant...

High signal Matched: decoding, prefill, generation, kv cache, performance, cost, gpu, release, introducing, cloud, open-source

Modal · inference-infra · 2024-10-25

The future of AI needs more flexible GPU capacity

Score 10

Why Modal is obsessed with serverless AI infrastructure

High signal Matched: gpu

Modal · inference-infra · 2024-08-06

GPU prices are falling...

Score 10

...and we're passing the savings to you. 15-30% price cuts on GPUs and CPUs.

High signal Matched: gpu

Modal · inference-infra · 2024-06-20

Run GPU jobs from Airflow with Modal

Score 10

Isolate your tasks with Modal containers while using Airflow for orchestration.

High signal Matched: gpu

Replicate · inference-infra · 2024-06-12

H100s are coming to Replicate

Score 8

We'll soon support NVIDIA's H100 GPUs for predictions and training. Let us know if you want early access.

hardware training

High signal Matched: h100, training

Hugging Face · open-source · 2024-05-21

Hugging Face on AMD Instinct MI300 GPU

Score 10

No feed summary available yet.

High signal Matched: gpu

SqueezeBits · korea · 2024-04-23

Are you getting everything out of your GPUs?

Score 12

The Blackwell GPU from GTC 2024 was astonishing.Analysis of the Nvidia GPU evolution & what it means for GPU users.

High signal Matched: gpu, blackwell

Hugging Face · open-source · 2024-04-02

Bringing serverless GPU inference to Hugging Face users

Score 14

No feed summary available yet.

High signal Matched: inference, gpu

Hugging Face · open-source · 2024-03-18

Easily Train Models with H100 GPUs on NVIDIA DGX Cloud

Score 14

No feed summary available yet.

High signal Matched: h100, cloud

Hugging Face · open-source · 2024-02-29

Text-Generation Pipeline on Intel® Gaudi® 2 AI Accelerator

Score 14

No feed summary available yet.

High signal Matched: generation, accelerator

Modal · inference-infra · 2024-02-06

Introducing: H100s on Modal

Score 14

We’re excited to be making Nvidia H100 GPUs available on Modal starting today!

inference serving moe benchmark hardware model-release

High signal Matched: h100, introducing

SkyPilot · open-source · 2023-12-21

Scaling Mixtral LLM Serving with High GPU Availability and Cost Efficiency

Score 24

A tutorial for serving Mixtral 8x7B model with SkyPilot and SkyServe.

High signal Matched: serving, mixtral, cost, gpu, model

Hugging Face · open-source · 2023-12-05

AMD + 🤗: Large Language Models Out-of-the-Box Acceleration with AMD GPU

Score 10

No feed summary available yet.

High signal Matched: gpu

Hugging Face · open-source · 2023-10-03

🧨 Accelerating Stable Diffusion XL Inference with JAX on Cloud TPU v5e

Score 18

No feed summary available yet.

inference hardware cloud

High signal Matched: inference, tpu, cloud

Hugging Face · open-source · 2023-06-13

Hugging Face and AMD partner on accelerating state-of-the-art models for CPU and GPU platforms

Score 10

No feed summary available yet.

High signal Matched: gpu

SkyPilot · open-source · 2023-05-30

SkyPilot 0.3: LLM support and unprecedented GPU availability across more clouds

Score 10

Announcing SkyPilot 0.3: LLM support, new clouds, and enhanced production readiness.

High signal Matched: gpu

Hugging Face · open-source · 2023-05-15

Run a Chatgpt-like Chatbot on a Single GPU with ROCm

Score 14

No feed summary available yet.

High signal Matched: gpu, rocm

Hugging Face · open-source · 2023-03-28

Fast Inference on Large Language Models: BLOOMZ on Habana Gaudi2 Accelerator

Score 14

No feed summary available yet.

hardware training fine-tuning

High signal Matched: inference, accelerator

Hugging Face · open-source · 2023-03-09

Fine-tuning 20B LLMs with RLHF on a 24GB consumer GPU

Score 10

No feed summary available yet.

High signal Matched: gpu, rlhf, fine-tuning

Replicate · inference-infra · 2022-08-31

Run Stable Diffusion on your M1 Mac’s GPU

Score 8

How to run Stable Diffusion locally so you can hack on it

hardware training agents open-source

High signal Matched: gpu

Lambda · cloud · 2026-04-30

Creating highly efficient agents: 450M tool-calling tokens distilled for post-training from top open-source models

Score 4

Harnesses If you've used Claude Code or Codex, you've used a harness. A harness is the infrastructure layer that wraps an AI coding agent and decides how it operates, what it can touch, and how you measure whether it worked. It's how most...