kernel - MLSys Blogs

Nebius · cloud · 2026-06-03

[Security Advisory] CVE-2026-43284, CVE-2026-43500: “DirtyFrag” Linux kernel local privilege escalation — mitigation required

Score 9

No feed summary available yet.

kernel

Open

High signal Matched: kernel

LightSeek Foundation · research · 2026-06-03

TokenSpeed: A Speed-of-Light LLM Inference Engine for Agentic Workloads

Score 17

No feed summary available yet.

inference kernel benchmark agents

Open

High signal Matched: inference, kernel, performance, agentic

PyTorch Foundation · open-source · 2026-05-28

Why Is PyTorch Compile So Fast: Kernel Fusion

Score 15

When you use PyTorch’s compiler, your model runs faster, up to 10x faster. But what’s actually happening? Without compilation, the GPU runs a kernel, a function on the GPU, for...

kernel hardware model-release

Open

High signal Matched: kernel, gpu, model

AMD ROCm Blogs · hardware · 2026-05-27

Deep Dive Into 4-Wave Interleave FP8 GEMM

Score 17

Our previous two posts in this GEMM optimization series covered Matrix Core instructions and 8-wave ping-pong FP8 GEMM design. Here we discuss another algorithm design introduced by HipKittens - 4-wave interleave, which further improves th...

kernel benchmark model-release quantization

Open

High signal Matched: gemm, performance, fp8

PyTorch Foundation · open-source · 2026-05-26

TLX Block Attention: A Warp-Specialized Blackwell Kernel for Fixed-Block Sparse Self-Attention

Score 18

Code available at: https://github.com/facebookresearch/ads_model_kernel_library In this post, we present the design of TLX Block Attention — a Triton kernel targeting NVIDIA Blackwell GPUs that exploits compile-time knowledge of a block-di...

kernel triton hardware model-release

Open

High signal Matched: kernel, triton, blackwell, model

NVIDIA Technical Blog · hardware · 2026-05-26

Extract More Kernel Performance with NVIDIA CompileIQ Auto-Tuning

Score 17

NVIDIA CompileIQ tackles one of the hardest problems in performance engineering: finding the compiler options that unlock the best performance for a specific...

kernel benchmark

Open

High signal Matched: kernel, performance

NVIDIA Technical Blog · hardware · 2026-05-26

Develop High-Performance GPU Kernels in C++ with NVIDIA CUDA Tile

Score 21

Developers can now use NVIDIA CUDA Tile programming within large existing C++  GPU codebases to develop highly optimized GPU kernels using tile-based...

kernel cuda benchmark hardware

Open

High signal Matched: cuda, performance, gpu

NVIDIA Technical Blog · hardware · 2026-05-26

NVIDIA CUDA 13.3 Enhances GPU Development with Tile Programming in C++, Compiler Autotuning, and Python Updates

Score 21

NVIDIA CUDA 13.3 brings new capabilities and performance optimizations to developers across the CUDA ecosystem. The launch of NVIDIA CUDA Tile programming in...

kernel cuda benchmark hardware model-release

Open

High signal Matched: cuda, performance, gpu, launch

AMD ROCm Blogs · hardware · 2026-05-22

From Build to Benchmark: ONNX Model Serving with Triton Inference Server on AMD GPUs

Score 30

Triton Inference Server is an open-source platform designed to streamline AI inferencing. It supports the deployment, scaling, and inference of trained models from multiple frameworks, including ONNX Runtime, TensorFlow, PyTorch, and other...

inference serving kernel triton benchmark model-release cloud open-source

Open

High signal Matched: inference, inferencing, serving, triton, benchmark, model, cloud, open-source

AMD ROCm Blogs · hardware · 2026-05-22

From Naive to Near-Peak: Building High-Performance GEMM Kernels with Gluon

Score 18

On a single MI355, our most-optimized FP16 GEMM kernel runs at 99% MFMA efficiency — the matrix engine sits idle for a handful of cycles per loop. Getting there took ten versions, a regression along the way, and a profiler open for the who...

kernel benchmark

Open

High signal Matched: kernel, gemm, performance

PyTorch Foundation · open-source · 2026-05-19

vLLM and PyTorch Work Together to Improve the Developer Experience on aarch64

Score 8

TLDR: PyTorch 2.11 makes it possible to install CUDA-enabled PyTorch wheels on aarch64 Linux directly from PyPI, eliminating the need for custom package indexes and workarounds that previously complicated deployment...

kernel cuda

Open

High signal Matched: cuda

PyTorch Foundation · open-source · 2026-05-14

PyTorch 2.12 Release Blog

Score 12

We are excited to announce the release of PyTorch® 2.12 (release notes)! The PyTorch 2.12 release features the following changes: Batched linalg.eigh on CUDA is up to 100x faster due...

kernel cuda model-release

Open

High signal Matched: cuda, release

Cloudflare Blog · cloud · 2026-05-12

When "idle" isn't idle: how a Linux kernel optimization became a QUIC bug

Score 8

We investigated a bug where CUBIC's congestion window became pinned at its minimum floor, causing a performance to plummet. The fix involved correctly measuring idle periods to distinguish RTT wait times from actual application idleness.

kernel benchmark

Open

High signal Matched: kernel, performance

Nota AI · korea · 2026-05-11

[NetsPresso® x AI Agents] Easier to Use, Even More Powerful

Score 52

  Jaehoon Lee Technical Content Manager, Nota AI   NetsPresso® now embraces AI agents. An easy-to-use interface sits on top of the validated pipeline that handles everything from model compression to device deployment.When a user...

inference serving kernel speculative-decoding moe benchmark hardware model-release research quantization evals agents api

Open

High signal Matched: inference, endpoint, kernel, verification, moe, benchmark, latency, cost, gpu, release, model, evaluation, quantization, quantized, int4, evaluate, benchmarks, swe-bench, mmlu, agent, agents, api

Together AI · inference-infra · 2026-05-11

Serving DeepSeek-V4: why million-token context is an inference systems problem

Score 22

DeepSeek-V4 makes million-token context a serving-systems problem. Together AI explores the inference work behind V4 on NVIDIA HGX B200, including compressed KV layouts, prefix caching, kernel maturity, and endpoint profiles for long-conte...

inference serving kernel hardware long-context api

Open

High signal Matched: inference, serving, endpoint, kernel, b200, long-context

NVIDIA Technical Blog · hardware · 2026-04-30

Automating GPU Kernel Translation with AI Agents: cuTile Python to cuTile.jl

Score 20

NVIDIA CUDA Tile (cuTile) is a tile-based programming model that enables developers to write GPU kernels in terms of tile-level operations—loads, stores, and...

kernel cuda hardware model-release agents

Open

High signal Matched: kernel, cuda, gpu, model, agents

Nota AI · korea · 2026-04-22

[Deep Dive: NetsPresso®] From Quantization to Graph Optimization: A Step-by-Step Model Deployment Pipeline

Score 54

  Jaehoon Lee Technical Content Manager, Nota AI   Series Notice: NetsPresso® Technical Blog, Part 2In Part 1, we walked through a scenario of deploying Llama 3.2 1B on an edge device to illustrate the NetsPresso® workflow. The f...

inference kernel cuda benchmark hardware model-release research korea training quantization evals api open-source

Open

High signal Matched: inference, kernel, cuda, matmul, benchmark, performance, latency, cost, npu, model, weights, paper, research, evaluation, furiosa, training, quantization, int8, int4, awq, gptq, sdk, open-source

NVIDIA Technical Blog · hardware · 2026-04-14

NVIDIA NVbandwidth: Your Essential Tool for Measuring GPU Interconnect and Memory Performance

Score 18

When you’re writing CUDA applications, one of the most important things you need to focus on to write great code is data transfer performance. This applies to...

kernel cuda benchmark hardware

Open

High signal Matched: cuda, performance, gpu

SkyPilot · open-source · 2026-04-09

Research-Driven Agents: What Happens When Your Agent Reads Before It Codes

Score 16

Coding agents working from code alone generate shallow hypotheses. Adding a research phase — arxiv papers, competing forks, other backends — produced 5 kernel fusions that made llama.cpp CPU inference 15% faster.

inference kernel research agents

Open

High signal Matched: inference, kernel, arxiv, research, agent, agents

NVIDIA Technical Blog · hardware · 2026-04-01

CUDA Tile Programming Now Available for BASIC!

Score 12

Note: CUDA Tile Programming in BASIC is an April Fools’ joke, but it's also real and actually works, demonstrating the flexibility of CUDA. CUDA 13.1...

kernel cuda

Open

High signal Matched: cuda

Together AI · inference-infra · 2026-04-01

Inside the Together AI kernels team

Score 16

The team behind FlashAttention and ThunderKittens — how Together AI's kernel researchers close the gap between GPU hardware and production AI.

kernel hardware

Open

High signal Matched: kernel, flashattention, gpu

Nota AI · korea · 2026-03-23

[GTC 2026 Recap] The Trillion-Dollar Inference Race Begins: How Nota AI Fills the Gap

Score 42

  Jaehoon Lee Technical Content Manager, Nota AI   GTC has evolved far beyond a technology conference, drawing attention from global economies and financial markets alike. This year, CEO Jensen Huang took the stage in his tradema...

inference serving kernel cuda kv-cache benchmark hardware model-release research cloud training long-context agents open-source

Open

High signal Matched: inference, prefill, generation, throughput, cuda, kv cache, performance, latency, cost, gpu, npu, launch, model, research, cloud, training, long-context, context window, agent, agents, agentic, open-source

Modular · inference-infra · 2026-03-16

Modular at NVIDIA GTC 2026: MAX on Blackwell, Mojo Kernel Porting, and DeepSeek V3 on B200

Score 18

Modular at NVIDIA GTC 2026: MAX on Blackwell, Mojo Kernel Porting, and DeepSeek V3 on B200

kernel hardware

Open

High signal Matched: kernel, b200, blackwell

Together AI · inference-infra · 2026-03-05

FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling

Score 20

As GPU throughput outpaces memory bandwidth, kernels must evolve. We introduce FlashAttention-4, featuring new pipelining for maximum overlap, 2-CTA MMA modes to reduce shared memory traffic, and a hardware-software hybrid approach to soft...

serving kernel benchmark hardware

Open

High signal Matched: throughput, kernel, flashattention, gpu

Together AI · inference-infra · 2026-03-05

Key research and product announcements at the AI Native Conf

Score 18

At AI Native Conf, Together AI announced breakthroughs across kernels, RL, and inference optimization — including FlashAttention-4, ThunderAgent, and together.compile. Research that ships to production. That's the AI Native Cloud.

inference kernel research cloud

Open

High signal Matched: inference, flashattention, research, cloud

vLLM Project · open-source · 2026-03-04

vLLM Triton Attention Backend Deep Dive

Score 14

This article is adapted from a Red Hat hosted vLLM Office Hours session with Burkhard Ringlein from IBM Research, featuring a deep technical walkthrough of the vLLM Triton attention backend....

kernel triton research

Open

High signal Matched: triton, research

Hugging Face · open-source · 2026-01-28

We Got Claude to Build CUDA Kernels and teach open models!

Score 10

No feed summary available yet.

kernel cuda

Open

High signal Matched: cuda

Modular · inference-infra · 2026-01-14

How to Beat Unsloth's CUDA Kernel Using Mojo—With Zero GPU Experience

Score 18

How to Beat Unsloth's CUDA Kernel Using Mojo—With Zero GPU Experience

kernel cuda hardware

Open

High signal Matched: kernel, cuda, gpu

Nota AI · korea · 2025-12-19

NVIDIA Blackwell; The Impact of NVFP4 For LLM Inference

Score 74

  Seungmin YangEdgeFM Lead, Nota AI On this page ▾ SummaryWith the introduction of NVFP4—a new 4-bit floating point data type in NVIDIA’s Blackwell GPU architecture—LLM inference achieves markedly improved efficiency.Blackwell’s NVFP4...

inference serving kernel cuda distributed benchmark hardware model-release research training quantization evals rag

Open

High signal Matched: inference, serving, decoding, prefill, generation, token generation, throughput, kernel, gemm, cutlass, distributed, benchmark, performance, latency, ttft, tpot, tokens/sec, cost, gpu, blackwell, launch, model, weights, fp8, research, training, post-training, quantization, quantized, awq, benchmarks, mmlu, retrieval

vLLM Project · open-source · 2025-12-03

Tracing Hanging and Complicated GPU Kernels Down To The Source Code

Score 16

Several months ago, we published a blog post about CUDA Core Dump: An Effective Tool to Debug Memory Access Issues and Beyond, introducing a powerful technique for debugging illegal memory access...

kernel cuda hardware model-release

Open

High signal Matched: cuda, gpu, introducing

SqueezeBits · korea · 2025-10-28

[Intel Gaudi] #6. GEMM, Attention, vLLM on Gaudi

Score 20

Explore how Intel’s new Gaudi-3 compares to Gaudi-2, NVIDIA A100, and H100. We analyze real-world GEMM efficiency, attention performance, and LLM serving results to uncover what truly matters for AI inference and training workloads.

inference serving kernel benchmark hardware training

Open

High signal Matched: inference, serving, gemm, performance, h100, training

Modular · inference-infra · 2025-09-05

Matrix Multiplication on Blackwell: Part 2 - Using Hardware Features to Optimize Matmul

Score 14

Matrix Multiplication on Blackwell: Part 2 - Using Hardware Features to Optimize Matmul

kernel hardware

Open

High signal Matched: matmul, blackwell

Hugging Face · open-source · 2025-08-18

From Zero to GPU: A Guide to Building and Scaling Production-Ready CUDA Kernels

Score 14

No feed summary available yet.

kernel cuda hardware

Open

High signal Matched: cuda, gpu

Hugging Face · open-source · 2025-06-12

Learn the Hugging Face Kernel Hub in 5 Minutes

Score 10

No feed summary available yet.

kernel

Open

High signal Matched: kernel

Modular · inference-infra · 2025-05-29

Modverse #48: Modular Platform 25.3, MAX AI Kernels, and the Modular GPU Kernel Hackathon

Score 14

Modverse #48: Modular Platform 25.3, MAX AI Kernels, and the Modular GPU Kernel Hackathon

kernel hardware

Open

High signal Matched: kernel, gpu

Modular · inference-infra · 2025-05-20

Modular GPU Kernel Hackathon Highlights: Innovation, Community, & Mojo🔥

Score 14

Modular GPU Kernel Hackathon Highlights: Innovation, Community, & Mojo🔥

kernel hardware

Open

High signal Matched: kernel, gpu

Nota AI · korea · 2025-04-08

UniForm: A Reuse Attention Mechanism for Efficient Transformers on Resource-Constrained Edge Devices

Score 24

  Seul-Ki Yeom, Ph. D. Research Lead, Nota AI GmbH Tae-Ho KimCTO & Co-Founder, Nota AI   SummaryDelivers real-time AI performance on edge devices such as smartphones, IoT devices, and embedded systems.Introduces a novel "Reus...

inference kernel benchmark model-release research evals

Open

High signal Matched: inference, kernel, benchmark, performance, cost, introducing, model, paper, research, benchmarks

Modular · inference-infra · 2025-03-26

What about Triton and Python eDSLs? (Democratizing AI Compute, Part 7)

Score 10

What about Triton and Python eDSLs? (Democratizing AI Compute, Part 7)

kernel triton

Open

High signal Matched: triton

Modular · inference-infra · 2025-03-25

MAX 25.2: Unleash the power of your H200's–without CUDA!

Score 14

MAX 25.2: Unleash the power of your H200's–without CUDA!

kernel cuda hardware

Open

High signal Matched: cuda, h200

Modular · inference-infra · 2025-03-05

What about OpenCL and CUDA C++ alternatives? (Democratizing AI Compute, Part 5)

Score 10

What about OpenCL and CUDA C++ alternatives? (Democratizing AI Compute, Part 5)

kernel cuda

Open

High signal Matched: cuda

Modular · inference-infra · 2025-02-20

CUDA is the incumbent, but is it any good? (Democratizing AI Compute, Part 4)

Score 10

CUDA is the incumbent, but is it any good? (Democratizing AI Compute, Part 4)

kernel cuda

Open

High signal Matched: cuda

Modular · inference-infra · 2025-02-12

How did CUDA succeed? (Democratizing AI Compute, Part 3)

Score 10

How did CUDA succeed? (Democratizing AI Compute, Part 3)

kernel cuda

Open

High signal Matched: cuda

Modular · inference-infra · 2025-02-05

What exactly is “CUDA”? (Democratizing AI Compute, Part 2)

Score 10

What exactly is “CUDA”? (Democratizing AI Compute, Part 2)

kernel cuda

Open

High signal Matched: cuda

Cloudflare Blog · cloud · 2026-05-07

How Cloudflare responded to the “Copy Fail” Linux vulnerability

Score 4

When a critical Linux kernel privilege escalation was publicly disclosed, Cloudflare's security and engineering teams detected, investigated, and mitigated the threat across our global fleet, confirming zero customer impact and no maliciou...

kernel

Open

Watchlist Matched: kernel

BAIR · research · 2025-03-25

Scaling Up Reinforcement Learning for Traffic Smoothing: A 100-AV Highway Deployment

Score 6

Training Diffusion Models with Reinforcement Learning We deployed 100 reinforcement learning (RL)-controlled cars into rush-hour highway traffic to smooth congestion and reduce fuel consumption for everyone. Our goal is to tackle "stop-and...

serving kernel benchmark model-release research training agents

Open

Watchlist Matched: throughput, kernel, performance, model, paper, training, agent, agents