Nebius · cloud · 2026-06-03
[Security Advisory] CVE-2026-43284, CVE-2026-43500: “DirtyFrag” Linux kernel local privilege escalation — mitigation required
No feed summary available yet.
High signal Matched: kernel
Nebius · cloud · 2026-06-03
No feed summary available yet.
High signal Matched: kernel
LightSeek Foundation · research · 2026-06-03
No feed summary available yet.
High signal Matched: inference, kernel, performance, agentic
PyTorch Foundation · open-source · 2026-05-28
When you use PyTorch’s compiler, your model runs faster, up to 10x faster. But what’s actually happening? Without compilation, the GPU runs a kernel, a function on the GPU, for...
High signal Matched: kernel, gpu, model
AMD ROCm Blogs · hardware · 2026-05-27
Our previous two posts in this GEMM optimization series covered Matrix Core instructions and 8-wave ping-pong FP8 GEMM design. Here we discuss another algorithm design introduced by HipKittens - 4-wave interleave, which further improves th...
High signal Matched: gemm, performance, fp8
PyTorch Foundation · open-source · 2026-05-26
Code available at: https://github.com/facebookresearch/ads_model_kernel_library In this post, we present the design of TLX Block Attention — a Triton kernel targeting NVIDIA Blackwell GPUs that exploits compile-time knowledge of a block-di...
High signal Matched: kernel, triton, blackwell, model
NVIDIA Technical Blog · hardware · 2026-05-26
NVIDIA CompileIQ tackles one of the hardest problems in performance engineering: finding the compiler options that unlock the best performance for a specific...
High signal Matched: kernel, performance
NVIDIA Technical Blog · hardware · 2026-05-26
Developers can now use NVIDIA CUDA Tile programming within large existing C++ GPU codebases to develop highly optimized GPU kernels using tile-based...
High signal Matched: cuda, performance, gpu
NVIDIA Technical Blog · hardware · 2026-05-26
NVIDIA CUDA 13.3 brings new capabilities and performance optimizations to developers across the CUDA ecosystem. The launch of NVIDIA CUDA Tile programming in...
High signal Matched: cuda, performance, gpu, launch
AMD ROCm Blogs · hardware · 2026-05-22
Triton Inference Server is an open-source platform designed to streamline AI inferencing. It supports the deployment, scaling, and inference of trained models from multiple frameworks, including ONNX Runtime, TensorFlow, PyTorch, and other...
High signal Matched: inference, inferencing, serving, triton, benchmark, model, cloud, open-source
AMD ROCm Blogs · hardware · 2026-05-22
On a single MI355, our most-optimized FP16 GEMM kernel runs at 99% MFMA efficiency — the matrix engine sits idle for a handful of cycles per loop. Getting there took ten versions, a regression along the way, and a profiler open for the who...
High signal Matched: kernel, gemm, performance
PyTorch Foundation · open-source · 2026-05-19
TLDR: PyTorch 2.11 makes it possible to install CUDA-enabled PyTorch wheels on aarch64 Linux directly from PyPI, eliminating the need for custom package indexes and workarounds that previously complicated deployment...
High signal Matched: cuda
PyTorch Foundation · open-source · 2026-05-14
We are excited to announce the release of PyTorch® 2.12 (release notes)! The PyTorch 2.12 release features the following changes: Batched linalg.eigh on CUDA is up to 100x faster due...
High signal Matched: cuda, release
Cloudflare Blog · cloud · 2026-05-12
We investigated a bug where CUBIC's congestion window became pinned at its minimum floor, causing a performance to plummet. The fix involved correctly measuring idle periods to distinguish RTT wait times from actual application idleness.
High signal Matched: kernel, performance
Nota AI · korea · 2026-05-11
Jaehoon Lee Technical Content Manager, Nota AI NetsPresso® now embraces AI agents. An easy-to-use interface sits on top of the validated pipeline that handles everything from model compression to device deployment.When a user...
High signal Matched: inference, endpoint, kernel, verification, moe, benchmark, latency, cost, gpu, release, model, evaluation, quantization, quantized, int4, evaluate, benchmarks, swe-bench, mmlu, agent, agents, api
Together AI · inference-infra · 2026-05-11
DeepSeek-V4 makes million-token context a serving-systems problem. Together AI explores the inference work behind V4 on NVIDIA HGX B200, including compressed KV layouts, prefix caching, kernel maturity, and endpoint profiles for long-conte...
High signal Matched: inference, serving, endpoint, kernel, b200, long-context
NVIDIA Technical Blog · hardware · 2026-04-30
NVIDIA CUDA Tile (cuTile) is a tile-based programming model that enables developers to write GPU kernels in terms of tile-level operations—loads, stores, and...
High signal Matched: kernel, cuda, gpu, model, agents
Nota AI · korea · 2026-04-22
Jaehoon Lee Technical Content Manager, Nota AI Series Notice: NetsPresso® Technical Blog, Part 2In Part 1, we walked through a scenario of deploying Llama 3.2 1B on an edge device to illustrate the NetsPresso® workflow. The f...
High signal Matched: inference, kernel, cuda, matmul, benchmark, performance, latency, cost, npu, model, weights, paper, research, evaluation, furiosa, training, quantization, int8, int4, awq, gptq, sdk, open-source
NVIDIA Technical Blog · hardware · 2026-04-14
When you’re writing CUDA applications, one of the most important things you need to focus on to write great code is data transfer performance. This applies to...
High signal Matched: cuda, performance, gpu
SkyPilot · open-source · 2026-04-09
Coding agents working from code alone generate shallow hypotheses. Adding a research phase — arxiv papers, competing forks, other backends — produced 5 kernel fusions that made llama.cpp CPU inference 15% faster.
High signal Matched: inference, kernel, arxiv, research, agent, agents
NVIDIA Technical Blog · hardware · 2026-04-01
Note: CUDA Tile Programming in BASIC is an April Fools’ joke, but it's also real and actually works, demonstrating the flexibility of CUDA. CUDA 13.1...
High signal Matched: cuda
Together AI · inference-infra · 2026-04-01
The team behind FlashAttention and ThunderKittens — how Together AI's kernel researchers close the gap between GPU hardware and production AI.
High signal Matched: kernel, flashattention, gpu
Nota AI · korea · 2026-03-23
Jaehoon Lee Technical Content Manager, Nota AI GTC has evolved far beyond a technology conference, drawing attention from global economies and financial markets alike. This year, CEO Jensen Huang took the stage in his tradema...
High signal Matched: inference, prefill, generation, throughput, cuda, kv cache, performance, latency, cost, gpu, npu, launch, model, research, cloud, training, long-context, context window, agent, agents, agentic, open-source
Modular · inference-infra · 2026-03-16
Modular at NVIDIA GTC 2026: MAX on Blackwell, Mojo Kernel Porting, and DeepSeek V3 on B200
High signal Matched: kernel, b200, blackwell
Together AI · inference-infra · 2026-03-05
As GPU throughput outpaces memory bandwidth, kernels must evolve. We introduce FlashAttention-4, featuring new pipelining for maximum overlap, 2-CTA MMA modes to reduce shared memory traffic, and a hardware-software hybrid approach to soft...
High signal Matched: throughput, kernel, flashattention, gpu
Together AI · inference-infra · 2026-03-05
At AI Native Conf, Together AI announced breakthroughs across kernels, RL, and inference optimization — including FlashAttention-4, ThunderAgent, and together.compile. Research that ships to production. That's the AI Native Cloud.
High signal Matched: inference, flashattention, research, cloud
vLLM Project · open-source · 2026-03-04
This article is adapted from a Red Hat hosted vLLM Office Hours session with Burkhard Ringlein from IBM Research, featuring a deep technical walkthrough of the vLLM Triton attention backend....
High signal Matched: triton, research
Hugging Face · open-source · 2026-01-28
No feed summary available yet.
High signal Matched: cuda
Modular · inference-infra · 2026-01-14
How to Beat Unsloth's CUDA Kernel Using Mojo—With Zero GPU Experience
High signal Matched: kernel, cuda, gpu
Nota AI · korea · 2025-12-19
Seungmin YangEdgeFM Lead, Nota AI On this page ▾ SummaryWith the introduction of NVFP4—a new 4-bit floating point data type in NVIDIA’s Blackwell GPU architecture—LLM inference achieves markedly improved efficiency.Blackwell’s NVFP4...
High signal Matched: inference, serving, decoding, prefill, generation, token generation, throughput, kernel, gemm, cutlass, distributed, benchmark, performance, latency, ttft, tpot, tokens/sec, cost, gpu, blackwell, launch, model, weights, fp8, research, training, post-training, quantization, quantized, awq, benchmarks, mmlu, retrieval
vLLM Project · open-source · 2025-12-03
Several months ago, we published a blog post about CUDA Core Dump: An Effective Tool to Debug Memory Access Issues and Beyond, introducing a powerful technique for debugging illegal memory access...
High signal Matched: cuda, gpu, introducing
SqueezeBits · korea · 2025-10-28
Explore how Intel’s new Gaudi-3 compares to Gaudi-2, NVIDIA A100, and H100. We analyze real-world GEMM efficiency, attention performance, and LLM serving results to uncover what truly matters for AI inference and training workloads.
High signal Matched: inference, serving, gemm, performance, h100, training
Modular · inference-infra · 2025-09-05
Matrix Multiplication on Blackwell: Part 2 - Using Hardware Features to Optimize Matmul
High signal Matched: matmul, blackwell
Hugging Face · open-source · 2025-08-18
No feed summary available yet.
High signal Matched: cuda, gpu
Hugging Face · open-source · 2025-06-12
No feed summary available yet.
High signal Matched: kernel
Modular · inference-infra · 2025-05-29
Modverse #48: Modular Platform 25.3, MAX AI Kernels, and the Modular GPU Kernel Hackathon
High signal Matched: kernel, gpu
Modular · inference-infra · 2025-05-20
Modular GPU Kernel Hackathon Highlights: Innovation, Community, & Mojo🔥
High signal Matched: kernel, gpu
Nota AI · korea · 2025-04-08
Seul-Ki Yeom, Ph. D. Research Lead, Nota AI GmbH Tae-Ho KimCTO & Co-Founder, Nota AI SummaryDelivers real-time AI performance on edge devices such as smartphones, IoT devices, and embedded systems.Introduces a novel "Reus...
High signal Matched: inference, kernel, benchmark, performance, cost, introducing, model, paper, research, benchmarks
Modular · inference-infra · 2025-03-26
What about Triton and Python eDSLs? (Democratizing AI Compute, Part 7)
High signal Matched: triton
Modular · inference-infra · 2025-03-25
MAX 25.2: Unleash the power of your H200's–without CUDA!
High signal Matched: cuda, h200
Modular · inference-infra · 2025-03-05
What about OpenCL and CUDA C++ alternatives? (Democratizing AI Compute, Part 5)
High signal Matched: cuda
Modular · inference-infra · 2025-02-20
CUDA is the incumbent, but is it any good? (Democratizing AI Compute, Part 4)
High signal Matched: cuda
Modular · inference-infra · 2025-02-12
How did CUDA succeed? (Democratizing AI Compute, Part 3)
High signal Matched: cuda
Modular · inference-infra · 2025-02-05
What exactly is “CUDA”? (Democratizing AI Compute, Part 2)
High signal Matched: cuda
Cloudflare Blog · cloud · 2026-05-07
When a critical Linux kernel privilege escalation was publicly disclosed, Cloudflare's security and engineering teams detected, investigated, and mitigated the threat across our global fleet, confirming zero customer impact and no maliciou...
Watchlist Matched: kernel
BAIR · research · 2025-03-25
Training Diffusion Models with Reinforcement Learning We deployed 100 reinforcement learning (RL)-controlled cars into rush-hour highway traffic to smooth congestion and reduce fuel consumption for everyone. Our goal is to tackle "stop-and...
Watchlist Matched: throughput, kernel, performance, model, paper, training, agent, agents