cuda - MLSys Blogs

NVIDIA Technical Blog · hardware · 2026-05-26

Develop High-Performance GPU Kernels in C++ with NVIDIA CUDA Tile

Score 21

Developers can now use NVIDIA CUDA Tile programming within large existing C++  GPU codebases to develop highly optimized GPU kernels using tile-based...

kernel cuda benchmark hardware

Open

High signal Matched: cuda, performance, gpu

NVIDIA Technical Blog · hardware · 2026-05-26

NVIDIA CUDA 13.3 Enhances GPU Development with Tile Programming in C++, Compiler Autotuning, and Python Updates

Score 21

NVIDIA CUDA 13.3 brings new capabilities and performance optimizations to developers across the CUDA ecosystem. The launch of NVIDIA CUDA Tile programming in...

kernel cuda benchmark hardware model-release

Open

High signal Matched: cuda, performance, gpu, launch

PyTorch Foundation · open-source · 2026-05-19

vLLM and PyTorch Work Together to Improve the Developer Experience on aarch64

Score 8

TLDR: PyTorch 2.11 makes it possible to install CUDA-enabled PyTorch wheels on aarch64 Linux directly from PyPI, eliminating the need for custom package indexes and workarounds that previously complicated deployment...

kernel cuda

Open

High signal Matched: cuda

PyTorch Foundation · open-source · 2026-05-14

PyTorch 2.12 Release Blog

Score 12

We are excited to announce the release of PyTorch® 2.12 (release notes)! The PyTorch 2.12 release features the following changes: Batched linalg.eigh on CUDA is up to 100x faster due...

kernel cuda model-release

Open

High signal Matched: cuda, release

NVIDIA Technical Blog · hardware · 2026-04-30

Automating GPU Kernel Translation with AI Agents: cuTile Python to cuTile.jl

Score 20

NVIDIA CUDA Tile (cuTile) is a tile-based programming model that enables developers to write GPU kernels in terms of tile-level operations—loads, stores, and...

kernel cuda hardware model-release agents

Open

High signal Matched: kernel, cuda, gpu, model, agents

Nota AI · korea · 2026-04-22

[Deep Dive: NetsPresso®] From Quantization to Graph Optimization: A Step-by-Step Model Deployment Pipeline

Score 54

  Jaehoon Lee Technical Content Manager, Nota AI   Series Notice: NetsPresso® Technical Blog, Part 2In Part 1, we walked through a scenario of deploying Llama 3.2 1B on an edge device to illustrate the NetsPresso® workflow. The f...

inference kernel cuda benchmark hardware model-release research korea training quantization evals api open-source

Open

High signal Matched: inference, kernel, cuda, matmul, benchmark, performance, latency, cost, npu, model, weights, paper, research, evaluation, furiosa, training, quantization, int8, int4, awq, gptq, sdk, open-source

NVIDIA Technical Blog · hardware · 2026-04-14

NVIDIA NVbandwidth: Your Essential Tool for Measuring GPU Interconnect and Memory Performance

Score 18

When you’re writing CUDA applications, one of the most important things you need to focus on to write great code is data transfer performance. This applies to...

kernel cuda benchmark hardware

Open

High signal Matched: cuda, performance, gpu

NVIDIA Technical Blog · hardware · 2026-04-01

CUDA Tile Programming Now Available for BASIC!

Score 12

Note: CUDA Tile Programming in BASIC is an April Fools’ joke, but it's also real and actually works, demonstrating the flexibility of CUDA. CUDA 13.1...

kernel cuda

Open

High signal Matched: cuda

Nota AI · korea · 2026-03-23

[GTC 2026 Recap] The Trillion-Dollar Inference Race Begins: How Nota AI Fills the Gap

Score 42

  Jaehoon Lee Technical Content Manager, Nota AI   GTC has evolved far beyond a technology conference, drawing attention from global economies and financial markets alike. This year, CEO Jensen Huang took the stage in his tradema...

inference serving kernel cuda kv-cache benchmark hardware model-release research cloud training long-context agents open-source

Open

High signal Matched: inference, prefill, generation, throughput, cuda, kv cache, performance, latency, cost, gpu, npu, launch, model, research, cloud, training, long-context, context window, agent, agents, agentic, open-source

Hugging Face · open-source · 2026-01-28

We Got Claude to Build CUDA Kernels and teach open models!

Score 10

No feed summary available yet.

kernel cuda

Open

High signal Matched: cuda

Modular · inference-infra · 2026-01-14

How to Beat Unsloth's CUDA Kernel Using Mojo—With Zero GPU Experience

Score 18

How to Beat Unsloth's CUDA Kernel Using Mojo—With Zero GPU Experience

kernel cuda hardware

Open

High signal Matched: kernel, cuda, gpu

Nota AI · korea · 2025-12-19

NVIDIA Blackwell; The Impact of NVFP4 For LLM Inference

Score 74

  Seungmin YangEdgeFM Lead, Nota AI On this page ▾ SummaryWith the introduction of NVFP4—a new 4-bit floating point data type in NVIDIA’s Blackwell GPU architecture—LLM inference achieves markedly improved efficiency.Blackwell’s NVFP4...

inference serving kernel cuda distributed benchmark hardware model-release research training quantization evals rag

Open

High signal Matched: inference, serving, decoding, prefill, generation, token generation, throughput, kernel, gemm, cutlass, distributed, benchmark, performance, latency, ttft, tpot, tokens/sec, cost, gpu, blackwell, launch, model, weights, fp8, research, training, post-training, quantization, quantized, awq, benchmarks, mmlu, retrieval

vLLM Project · open-source · 2025-12-03

Tracing Hanging and Complicated GPU Kernels Down To The Source Code

Score 16

Several months ago, we published a blog post about CUDA Core Dump: An Effective Tool to Debug Memory Access Issues and Beyond, introducing a powerful technique for debugging illegal memory access...

kernel cuda hardware model-release

Open

High signal Matched: cuda, gpu, introducing

Hugging Face · open-source · 2025-08-18