AMD ROCm Blogs

AMD ROCm technical blog covering AI, HPC, GPU software, vLLM, kernels, and performance optimization on AMD accelerators.

Country: Unknown
Category: hardware
Blog: https://rocm.blogs.amd.com/
Feed: https://rocm.blogs.amd.com/blog/atom.xml
Feed discovery status: known

AMD ROCm Blogs · hardware · 2026-06-01

Out-of-the-Box ROLL Support on AMD GPUs: Accelerating Reinforcement Learning at Scale

Score 13

Reinforcement learning (RL) is rapidly becoming a foundational technology for Large Language Models (LLMs)—powering key abilities such as reasoning and agentic behaviors. As RL workloads grow more complex and computationally intensive, the...

benchmark hardware agents

Open

High signal Matched: performance, gpu, agentic

AMD ROCm Blogs · hardware · 2026-06-01

Performance Profiling on AMD GPUs - Part 4: Fortran OpenMP Offload Edition

Score 11

This blog, like the previous articles in the profiling guide series (Part 1, Part 2, and Part 3), is designed to help you systematically analyze and improve the performance of your Fortran OpenMP offload applications running on AMD GPUs. T...

benchmark

Open

High signal Matched: performance

AMD ROCm Blogs · hardware · 2026-05-29

Enabling Speculative Speculative Decoding on MI300X

Score 29

Speculative speculative decoding (SSD) [1] is a recently proposed speculative decoding (SD) algorithm that further accelerates large language model (LLM) inference beyond conventional SD. In standard SD, a small draft model proposes severa...

inference speculative-decoding benchmark hardware model-release

Open

High signal Matched: inference, decoding, speculative decoding, draft model, verification, cost, mi300x, model

AMD ROCm Blogs · hardware · 2026-05-29

Running Variational Quantum Eigensolver with Qiskit Aer on AMD Instinct

Score 13

Quantum computing offers a fundamentally different approach to computational problems by leveraging quantum mechanical properties such as superposition and entanglement. Unlike a classical bit, which is always 0 or 1, a qubit can exist in...

benchmark hardware

Open

High signal Matched: benchmark, cost, gpu

AMD ROCm Blogs · hardware · 2026-05-27

Deep Dive Into 4-Wave Interleave FP8 GEMM

Score 17

Our previous two posts in this GEMM optimization series covered Matrix Core instructions and 8-wave ping-pong FP8 GEMM design. Here we discuss another algorithm design introduced by HipKittens - 4-wave interleave, which further improves th...

kernel benchmark model-release quantization

Open

High signal Matched: gemm, performance, fp8

AMD ROCm Blogs · hardware · 2026-05-25

AI Inference on AMD Ryzen™ AI Max Processor

Score 20

Local large language model (LLM) inference has rapidly evolved, but a persistent limitation remains: model size is constrained by available GPU memory. Discrete GPUs typically offer 8–24 GB of dedicated VRAM, which can limit the size of mo...

inference distributed hardware model-release cloud quantization evals

Open

High signal Matched: inference, multi-gpu, gpu, model, checkpoint, cloud, quantization, evaluate

AMD ROCm Blogs · hardware · 2026-05-22

From Build to Benchmark: ONNX Model Serving with Triton Inference Server on AMD GPUs

Score 30

Triton Inference Server is an open-source platform designed to streamline AI inferencing. It supports the deployment, scaling, and inference of trained models from multiple frameworks, including ONNX Runtime, TensorFlow, PyTorch, and other...

inference serving kernel triton benchmark model-release cloud open-source

Open

High signal Matched: inference, inferencing, serving, triton, benchmark, model, cloud, open-source

AMD ROCm Blogs · hardware · 2026-05-22

From Naive to Near-Peak: Building High-Performance GEMM Kernels with Gluon

Score 18

On a single MI355, our most-optimized FP16 GEMM kernel runs at 99% MFMA efficiency — the matrix engine sits idle for a handful of cycles per loop. Getting there took ten versions, a regression along the way, and a profiler open for the who...

kernel benchmark

Open

High signal Matched: kernel, gemm, performance

AMD ROCm Blogs · hardware · 2026-05-20

ROCm 7.13: Expanding Hardware, Tools, and Reach

Score 14

AMD released ROCm Core 7.13, the AMD GPU Driver 31.30, and AMD GPU Virtualization 9.0. With these releases, ROCm software expands hardware support across enterprise datacenters. The platform introduces AMD’s latest Instinct accelerators, e...

benchmark hardware open-source

Open

High signal Matched: performance, gpu, rocm, open-source

AMD ROCm Blogs · hardware · 2026-05-20

QuickReduce FP4 Quantization and Benchmarking on MI355

Score 12

Large Language Models (LLMs) typically contain billions — or even tens of billions — of parameters. During inference, tensor parallelism is commonly employed to distribute the workload across multiple GPUs. This approach demands frequent,...

inference benchmark model-release quantization

Open

High signal Matched: inference, latency, introducing, quantization