MLSys Radar

hardware

Vast.ai · cloud · 2026-06-03

GPU Cloud

Score 13

No feed summary available yet.

hardware cloud

Open

High signal Matched: gpu, cloud

CoreWeave · cloud · 2026-06-03

GPU compute

Score 11

No feed summary available yet.

hardware

Open

High signal Matched: gpu

Crusoe · cloud · 2026-06-03

NVIDIA GB200

Score 9

No feed summary available yet.

hardware

Open

High signal Matched: gb200

Crusoe · cloud · 2026-06-03

NVIDIA B200

Score 9

No feed summary available yet.

hardware

Open

High signal Matched: b200

Crusoe · cloud · 2026-06-03

NVIDIA H200

Score 9

No feed summary available yet.

hardware

Open

High signal Matched: h200

Crusoe · cloud · 2026-06-03

NVIDIA H100

Score 9

No feed summary available yet.

hardware

Open

High signal Matched: h100

Crusoe · cloud · 2026-06-03

AMD MI300X

Score 9

No feed summary available yet.

hardware

Open

High signal Matched: mi300x

Lambda · cloud · 2026-06-03

Introducing workspaces for Lambda Cloud

Score 17

Lambda workspaces help teams organize cloud resources, control access, and separate dev, staging, and production in shared GPU environments. A junior researcher kills a production training run. A contractor sees weights they shouldn't. If...

hardware model-release cloud training

Open

High signal Matched: gpu, introducing, weights, cloud, training

Lambda · cloud · 2026-06-01

Unbox one of NVIDIA's first co-packaged optics switches with us. See why we bet on CPO early.

Score 15

When we design large GPU clusters, the network is no longer a background system. It's part of the compute envelope. At the 800G and NVIDIA GB300 NVL72 scale, the back-end fabric accounts for 86% of networking power in a three-layer cluster...

inference serving distributed benchmark hardware model-release rag agents

Open

High signal Matched: generation, token generation, throughput, infiniband, gpu, model, retrieval, agentic

Nota AI · korea · 2026-05-29

Full-Stack Optimization for Low-Light Video on Jetson Orin NX: From 400 ms to 28 ms

Score 23

  Jaehoon Lee Technical Content Manager, Nota AI   When enterprises adopt AI, the most common bottleneck is not model development. It is the deployment stage: getting a finished model to run reliably on the actual target device.T...

inference serving benchmark hardware model-release research quantization evals

Open

High signal Matched: inference, throughput, benchmark, performance, latency, cost, gpu, model, evaluation, quantization, int8, benchmarks, leaderboard

AMD ROCm Blogs · hardware · 2026-05-29

Enabling Speculative Speculative Decoding on MI300X

Score 29

Speculative speculative decoding (SSD) [1] is a recently proposed speculative decoding (SD) algorithm that further accelerates large language model (LLM) inference beyond conventional SD. In standard SD, a small draft model proposes severa...

inference speculative-decoding benchmark hardware model-release

Open

High signal Matched: inference, decoding, speculative decoding, draft model, verification, cost, mi300x, model

AMD ROCm Blogs · hardware · 2026-05-25

AI Inference on AMD Ryzen™ AI Max Processor

Score 20

Local large language model (LLM) inference has rapidly evolved, but a persistent limitation remains: model size is constrained by available GPU memory. Discrete GPUs typically offer 8–24 GB of dedicated VRAM, which can limit the size of mo...

inference distributed hardware model-release cloud quantization evals

Open

High signal Matched: inference, multi-gpu, gpu, model, checkpoint, cloud, quantization, evaluate

Lambda · cloud · 2026-05-20

Lambda’s NVIDIA HGX B200 on STAC-AI™ LANG6

Score 18

What the numbers mean for financial services Executive summary Lambda is the first to publish an audited STAC-AI™ LANG6 result on NVIDIA HGX B200, with independently verified performance data that Financial Services Industry (FSI) infrastr...

inference benchmark hardware model-release evals

Open

High signal Matched: inference, generation, performance, gpu, h200, b200, model, evaluating

AMD ROCm Blogs · hardware · 2026-05-20

ROCm 7.13: Expanding Hardware, Tools, and Reach

Score 14

AMD released ROCm Core 7.13, the AMD GPU Driver 31.30, and AMD GPU Virtualization 9.0. With these releases, ROCm software expands hardware support across enterprise datacenters. The platform introduces AMD’s latest Instinct accelerators, e...

benchmark hardware open-source

Open

High signal Matched: performance, gpu, rocm, open-source

Nota AI · korea · 2026-05-11

[NetsPresso® x AI Agents] Easier to Use, Even More Powerful

Score 52

  Jaehoon Lee Technical Content Manager, Nota AI   NetsPresso® now embraces AI agents. An easy-to-use interface sits on top of the validated pipeline that handles everything from model compression to device deployment.When a user...

inference serving kernel speculative-decoding moe benchmark hardware model-release research quantization evals agents api

Open

High signal Matched: inference, endpoint, kernel, verification, moe, benchmark, latency, cost, gpu, release, model, evaluation, quantization, quantized, int4, evaluate, benchmarks, swe-bench, mmlu, agent, agents, api

Together AI · inference-infra · 2026-05-11

Serving DeepSeek-V4: why million-token context is an inference systems problem

Score 22

DeepSeek-V4 makes million-token context a serving-systems problem. Together AI explores the inference work behind V4 on NVIDIA HGX B200, including compressed KV layouts, prefix caching, kernel maturity, and endpoint profiles for long-conte...

inference serving kernel hardware long-context api

Open

High signal Matched: inference, serving, endpoint, kernel, b200, long-context

Together AI · inference-infra · 2026-05-08

Deploy and inference any model from HuggingFace

Score 20

Learn how to deploy any Hugging Face model in one session using Goose and Together's Dedicated Container Inference. Skip the setup complexity — one prompt gets your model running in a production-grade GPU environment on release day.

inference hardware model-release

Open

High signal Matched: inference, gpu, release, model

LMCache · open-source · 2026-04-23

LMCache on Amazon SageMaker HyperPod: Accelerating LLM Inference with Managed Tiered KV Cache

Score 30

Overview Large language model (LLM) inference performance depends heavily on how efficiently the system manages key-value (KV) cache — the stored attention states that allow the model to avoid recomputing previous tokens. As context length...

inference kv-cache benchmark hardware model-release cloud

Open

High signal Matched: inference, kv cache, lmcache, performance, latency, gpu, model, sagemaker

Nota AI · korea · 2026-04-22

[Deep Dive: NetsPresso®] From Quantization to Graph Optimization: A Step-by-Step Model Deployment Pipeline

Score 54

  Jaehoon Lee Technical Content Manager, Nota AI   Series Notice: NetsPresso® Technical Blog, Part 2In Part 1, we walked through a scenario of deploying Llama 3.2 1B on an edge device to illustrate the NetsPresso® workflow. The f...

inference kernel cuda benchmark hardware model-release research korea training quantization evals api open-source

Open

High signal Matched: inference, kernel, cuda, matmul, benchmark, performance, latency, cost, npu, model, weights, paper, research, evaluation, furiosa, training, quantization, int8, int4, awq, gptq, sdk, open-source

Nota AI · korea · 2026-04-08

[Overview: NetsPresso®] A Platform That Handles Everything from Model Optimization to Target Deployment

Score 36

  Jaehoon Lee Technical Content Manager, Nota AI   AI Model Optimization: Why Models Won't Run on HardwareThe Chip Is Ready, but the Model Won't DeployIf you have ever tried deploying an AI model onto your own chip, the following...

inference distributed kv-cache speculative-decoding benchmark hardware model-release research quantization evals

Open

High signal Matched: inference, multi-gpu, kv cache, verification, performance, latency, gpu, model, research, evaluation, quantization, quantized, awq, gptq, evaluate

Rebellions · hardware · 2026-04-02

NPU 서버 기반 피지컬 AI, 아랍에미리트(UAE) 수질 정화 로봇 솔루션

Score 14

Summary Challenge 석유 및 가스 산업이 발달한 중동 지역에서는 원유 생산 과정에서 불가피하게 발생하는 폐수와 기름을 처리해야 합니다. 특히, 저수지와... The post NPU 서버 기반 피지컬 AI, 아랍에미리트(UAE) 수질 정화 로봇 솔루션 appeared first on Rebellions.

hardware korea

Open

High signal Matched: npu, rebellions

Together AI · inference-infra · 2026-04-01

Inside the Together AI kernels team

Score 16

The team behind FlashAttention and ThunderKittens — how Together AI's kernel researchers close the gap between GPU hardware and production AI.

kernel hardware

Open

High signal Matched: kernel, flashattention, gpu

Nota AI · korea · 2026-03-31

The Real Reason TurboQuant Shook the Market: AI Optimization Has Gone Mainstream

Score 46

  Jaehoon Lee Technical Content Manager, Nota AI   In March, a single official announcement from Google Research rocked trillions of won in the market capitalization of U.S. infrastructure and semiconductor stocks. The catalyst:...

inference serving kv-cache benchmark hardware model-release research training fine-tuning quantization agents frontier-model

Open

High signal Matched: inference, serving, generation, throughput, kv cache, benchmark, performance, cost, b200, blackwell, introducing, model, fp8, research, training, fine-tuning, quantization, quantized, agent, agentic, frontier model

Nota AI · korea · 2026-03-23

[GTC 2026 Recap] The Trillion-Dollar Inference Race Begins: How Nota AI Fills the Gap

Score 42

  Jaehoon Lee Technical Content Manager, Nota AI   GTC has evolved far beyond a technology conference, drawing attention from global economies and financial markets alike. This year, CEO Jensen Huang took the stage in his tradema...

inference serving kernel cuda kv-cache benchmark hardware model-release research cloud training long-context agents open-source

Open

High signal Matched: inference, prefill, generation, throughput, cuda, kv cache, performance, latency, cost, gpu, npu, launch, model, research, cloud, training, long-context, context window, agent, agents, agentic, open-source

Nota AI · korea · 2026-03-13

NotaMoEQuantization: An MoE-Specific Quantization Method for Solar-Open-100B

Score 62

  Hancheol Park, Ph. D. AI Research Engineer, Nota AI Tairen PiaoAI Research Engineer, Nota AI Tae-Ho KimCTO & Co-Founder, Nota AI ✔️ Resource : The official quantized model of Solar-Open-100B, which passed the first round of Sout...

inference serving moe benchmark hardware model-release research korea training quantization evals long-context open-source

Open

High signal Matched: inference, serving, prefill, generation, throughput, moe, router, benchmark, performance, latency, ttft, tpot, blackwell, release, model, weights, open model, research, evaluation, korea, korean, upstage, training, post-training, quantization, quantized, int4, evaluate, benchmarks, mmlu, long-context

Together AI · inference-infra · 2026-01-13

Learn how Cursor partnered with Together AI to deliver real-time, low-latency inference at scale

Score 24

Together AI teamed with Cursor to build the real-time inference stack that keeps in-editor agents fast and reliable. They productionized NVIDIA Blackwell (B200/GB200), tuning ARM hosts, kernels, and FP4/TensorRT quantization for low latenc...

inference benchmark hardware model-release quantization agents

Open

High signal Matched: inference, latency, b200, gb200, blackwell, model, quantization, agents

SqueezeBits · korea · 2025-12-24

Introducing rebellions ATOM™-MAX

Score 24

Introducing ATOM™-Max, rebellions’ next-generation NPU designed for high-performance AI inference. Learn how its runtime, profiling tools, and PyTorch-native integrations enable developers to run and serve models efficiently without sacrif...

inference serving benchmark hardware model-release korea

Open

High signal Matched: inference, generation, serve, performance, npu, introducing, rebellions

Nota AI · korea · 2025-12-19

NVIDIA Blackwell; The Impact of NVFP4 For LLM Inference

Score 74

  Seungmin YangEdgeFM Lead, Nota AI On this page ▾ SummaryWith the introduction of NVFP4—a new 4-bit floating point data type in NVIDIA’s Blackwell GPU architecture—LLM inference achieves markedly improved efficiency.Blackwell’s NVFP4...

inference serving kernel cuda distributed benchmark hardware model-release research training quantization evals rag

Open

High signal Matched: inference, serving, decoding, prefill, generation, token generation, throughput, kernel, gemm, cutlass, distributed, benchmark, performance, latency, ttft, tpot, tokens/sec, cost, gpu, blackwell, launch, model, weights, fp8, research, training, post-training, quantization, quantized, awq, benchmarks, mmlu, retrieval

Together AI · inference-infra · 2025-12-01

Together AI delivers fastest inference for the top open-source models

Score 20

Together AI achieves up to 2x faster inference for top open-source models like Qwen, DeepSeek, and Kimi through GPU optimization, advanced speculative decoding, and FP4 quantization—ranking #1 in speed benchmarks on NVIDIA Blackwell archit...

inference speculative-decoding hardware quantization evals open-source

Open

High signal Matched: inference, decoding, speculative decoding, gpu, blackwell, quantization, benchmarks, open-source

Rebellions · hardware · 2025-11-20

NPU로 구동되는 AI 기반 동물 영상 진단 보조 서비스

Score 14

Summary Challenge 최근 반려동물 양육 인구의 증가로 X-ray 영상 진단 수요가 빠르게 확대되고 있습니다. 그러나 국내 영상의학 전공 수의사는 수백... The post NPU로 구동되는 AI 기반 동물 영상 진단 보조 서비스 appeared first on Rebellions.

hardware korea

Open

High signal Matched: npu, rebellions

Rebellions · hardware · 2025-11-07

vLLM Hands-on Workshop WrapUp

Score 14

리벨리온 NPU에서 직접 경험한 LLM 추론의 새로운 가능성 지난 8월 vLLM Korea Meetup에 이어, 10월 29일 리벨리온과 스퀴즈비츠 주관으로 vLLM... The post vLLM Hands-on Workshop WrapUp appeared first on Rebellions.

hardware korea

Open

High signal Matched: npu, korea, rebellions

Rebellions · hardware · 2025-08-21

AI로 예방 중심의 건설 & 플랜트 프로젝트 현장 안전 관리 실현

Score 14

비전 모델과 언어 모델을 결합한 멀티모달, GPU와 NPU를 결합한 하이브리드 인프라로 기존 시스템의 제약을 극복하는 차별화된 AI 기반 안전 관제 시스템, ‘AI 비전 인텔리전스'를 개발한 코오롱베니트의 사례 The post AI로 예방 중심의 건설 & 플랜트 프로젝트 현장 안전 관리 실현 appeared first on Rebellions.

hardware korea

Open

High signal Matched: gpu, npu, rebellions

AIBrix · open-source · 2025-08-05

AIBrix v0.4.0 Release: P/D Disaggregation and Expert Parallelism Support, KVCache v1 Connector, KV Event Synchronization & Multi‑Engine Support

Score 20

AIBrix is a composable, cloud‑native LLM inference infrastructure designed to deliver high performance and low cost at scale. We now present a major update in a new release - v0.4.0. This release tackles key bottlenecks in orchestration an...

inference serving benchmark hardware model-release cloud

Open

High signal Matched: inference, prefill, generation, token generation, throughput, performance, cost, gpu, release, cloud

SkyPilot · open-source · 2025-07-16

The Evolution of AI Job Orchestration. Part 2: The AI-Native Control Plane & Orchestration that Finally Works for ML

Score 16

This is Part 2 of our series on the evolution of AI Job Orchestration. In Part 1, we explored how Neoclouds are democratizing GPU access but leaving the “last mile” unsolved. Now we’ll discover how AI-native orchestration...

distributed benchmark hardware cloud

Open

High signal Matched: infiniband, performance, cost, gpu, cloud

AIBrix · open-source · 2025-02-19

AIBrix v0.2.0 Release: Distributed KV Cache, Orchestration and Heterogeneous GPU Support

Score 42

We’re excited to announce the v0.2.0 release of AIBrix! Building on feedback from v0.1.0 production adoption and user interest, this release introduces several new features to enhance performance and usability. Extend the vLLM Prefix...

inference serving distributed kv-cache benchmark hardware model-release agents

Open

High signal Matched: inference, serving, prefill, throughput, distributed, multi-node, kv cache, prefix cache, performance, cost, gpu, accelerator, release, agent

SqueezeBits · korea · 2024-11-21

[Intel Gaudi] #1. Introduction

Score 12

In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.

benchmark hardware evals

Open

High signal Matched: performance, accelerator, evaluate

AIBrix · open-source · 2024-11-13

Introducing AIBrix v0.1.0: Building the Future of Scalable, Cost-Effective AI Infrastructure for Large Models

Score 32

In recent years, large language models (LLMs) have revolutionized AI applications, powering solutions in areas like chatbots, automated content generation, and advanced recommendation engines. Services like OpenAI’s have gained significant...

inference kv-cache benchmark hardware model-release cloud open-source

Open

High signal Matched: decoding, prefill, generation, kv cache, performance, cost, gpu, release, introducing, cloud, open-source

Modal · inference-infra · 2024-08-06

GPU prices are falling...

Score 10

...and we're passing the savings to you. 15-30% price cuts on GPUs and CPUs.

hardware

Open

High signal Matched: gpu

Replicate · inference-infra · 2024-06-12

H100s are coming to Replicate

Score 8

We'll soon support NVIDIA's H100 GPUs for predictions and training. Let us know if you want early access.

hardware training

Open

High signal Matched: h100, training

Lambda · cloud · 2026-04-30

Creating highly efficient agents: 450M tool-calling tokens distilled for post-training from top open-source models

Score 4

Harnesses If you've used Claude Code or Codex, you've used a harness. A harness is the infrastructure layer that wraps an AI coding agent and decides how it operates, what it can touch, and how you measure whether it worked. It's how most...

hardware training agents open-source

Open

Watchlist Matched: gpu, training, post-training, agent, agents, open-source