quantization - MLSys Blogs

vLLM Project · open-source · 2026-06-02

Accelerating vLLM-Omni Inference with AutoRound Quantization

Score 13

We are excited to announce that AutoRound — Intel's state-of-the-art post-training quantization (PTQ) algorithm — is now fully integrated into vLLM-Omni, enabling a streamlined quantize-once,...

inference training quantization

Open

High signal Matched: inference, training, post-training, quantization

Nota AI · korea · 2026-05-29

Full-Stack Optimization for Low-Light Video on Jetson Orin NX: From 400 ms to 28 ms

Score 23

  Jaehoon Lee Technical Content Manager, Nota AI   When enterprises adopt AI, the most common bottleneck is not model development. It is the deployment stage: getting a finished model to run reliably on the actual target device.T...

inference serving benchmark hardware model-release research quantization evals

Open

High signal Matched: inference, throughput, benchmark, performance, latency, cost, gpu, model, evaluation, quantization, int8, benchmarks, leaderboard

AMD ROCm Blogs · hardware · 2026-05-27

Deep Dive Into 4-Wave Interleave FP8 GEMM

Score 17

Our previous two posts in this GEMM optimization series covered Matrix Core instructions and 8-wave ping-pong FP8 GEMM design. Here we discuss another algorithm design introduced by HipKittens - 4-wave interleave, which further improves th...

kernel benchmark model-release quantization

Open

High signal Matched: gemm, performance, fp8

AMD ROCm Blogs · hardware · 2026-05-25

AI Inference on AMD Ryzen™ AI Max Processor

Score 20

Local large language model (LLM) inference has rapidly evolved, but a persistent limitation remains: model size is constrained by available GPU memory. Discrete GPUs typically offer 8–24 GB of dedicated VRAM, which can limit the size of mo...

inference distributed hardware model-release cloud quantization evals

Open

High signal Matched: inference, multi-gpu, gpu, model, checkpoint, cloud, quantization, evaluate

AMD ROCm Blogs · hardware · 2026-05-20

QuickReduce FP4 Quantization and Benchmarking on MI355

Score 12

Large Language Models (LLMs) typically contain billions — or even tens of billions — of parameters. During inference, tensor parallelism is commonly employed to distribute the workload across multiple GPUs. This approach demands frequent,...

inference benchmark model-release quantization

Open

High signal Matched: inference, latency, introducing, quantization

LMCache · open-source · 2026-05-13

Benchmarking LMCache for Multi-Turn Agentic Workloads on AMD MI300X

Score 20

A practitioner’s guide to KV-cache tiering on ROCm — what works, what doesn’t, and the regime where it actually matters. Key Summary We benchmarked multi-turn agentic workloads using 739 anonymized Claude Code conversation trac...

kv-cache moe hardware model-release quantization agents

Open

High signal Matched: lmcache, moe, mi300x, rocm, fp8, agentic

Nota AI · korea · 2026-05-11

[NetsPresso® x AI Agents] Easier to Use, Even More Powerful

Score 52

  Jaehoon Lee Technical Content Manager, Nota AI   NetsPresso® now embraces AI agents. An easy-to-use interface sits on top of the validated pipeline that handles everything from model compression to device deployment.When a user...

inference serving kernel speculative-decoding moe benchmark hardware model-release research quantization evals agents api

Open

High signal Matched: inference, endpoint, kernel, verification, moe, benchmark, latency, cost, gpu, release, model, evaluation, quantization, quantized, int4, evaluate, benchmarks, swe-bench, mmlu, agent, agents, api

vLLM Project · open-source · 2026-05-11

A First Comprehensive Study of TurboQuant: Accuracy and Performance

Score 14

TurboQuant, a method for KV-cache quantization, recently gained significant traction in the community due to the large advertised savings in GPU memory from very low bit-width quantization of a...

benchmark hardware quantization

Open

High signal Matched: performance, gpu, quantization

NVIDIA Technical Blog · hardware · 2026-05-07

Model Quantization: Post-Training Quantization Using NVIDIA Model Optimizer

Score 16

Model quantization is an effective method to reduce VRAM usage and improve inference performance on consumer devices such as NVIDIA GeForce RTX GPUs. By...

inference benchmark model-release training quantization

Open

High signal Matched: inference, performance, model, training, post-training, quantization

Nota AI · korea · 2026-04-29

[NVIDIA Nemotron Hackathon] Grand Prize Among 20 Teams: Behind Two Sleepless Days

Score 32

  Hancheol Park, Ph. D.AI Research Engineer, NetsPresso Tech, Nota AI Geonmin Kim, Ph. D.AI Research Engineer, NetsPresso Tech, Nota AI Geonho LeeEdge AI Engineer Intern, NetsPresso Tech, Nota AI Jaehoon Lee Technical Content Manager,...

inference moe benchmark model-release research korea training fine-tuning quantization evals agents

Open

High signal Matched: generation, moe, performance, model, weights, paper, research, evaluation, korea, korean, seoul, naver, training, fine-tuning, quantization, agent, agents, agentic

Nota AI · korea · 2026-04-22

[Deep Dive: NetsPresso®] From Quantization to Graph Optimization: A Step-by-Step Model Deployment Pipeline

Score 54

  Jaehoon Lee Technical Content Manager, Nota AI   Series Notice: NetsPresso® Technical Blog, Part 2In Part 1, we walked through a scenario of deploying Llama 3.2 1B on an edge device to illustrate the NetsPresso® workflow. The f...

inference kernel cuda benchmark hardware model-release research korea training quantization evals api open-source

Open

High signal Matched: inference, kernel, cuda, matmul, benchmark, performance, latency, cost, npu, model, weights, paper, research, evaluation, furiosa, training, quantization, int8, int4, awq, gptq, sdk, open-source

vLLM Project · open-source · 2026-04-22

The State of FP8 KV-Cache and Attention Quantization in vLLM

Score 18

Long-context LLM serving is increasingly memory-bound: for standard full-attention decoders, the KV cache often dominates GPU memory at 128k+ contexts, and each decode step must read a large...

inference serving kv-cache hardware model-release quantization long-context

Open

High signal Matched: serving, kv cache, gpu, fp8, quantization, long-context

NVIDIA Technical Blog · hardware · 2026-04-20

Run High-Throughput Reinforcement Learning Training with End-to-End FP8 Precision

Score 18

As LLMs transition from simple text generation to complex reasoning, reinforcement learning (RL) plays a central role. Algorithms like Group Relative Policy...

inference serving benchmark model-release training quantization

Open

High signal Matched: generation, throughput, fp8, training

Nota AI · korea · 2026-04-08

[Overview: NetsPresso®] A Platform That Handles Everything from Model Optimization to Target Deployment

Score 36

  Jaehoon Lee Technical Content Manager, Nota AI   AI Model Optimization: Why Models Won't Run on HardwareThe Chip Is Ready, but the Model Won't DeployIf you have ever tried deploying an AI model onto your own chip, the following...

inference distributed kv-cache speculative-decoding benchmark hardware model-release research quantization evals

Open

High signal Matched: inference, multi-gpu, kv cache, verification, performance, latency, gpu, model, research, evaluation, quantization, quantized, awq, gptq, evaluate

Nota AI · korea · 2026-03-31

The Real Reason TurboQuant Shook the Market: AI Optimization Has Gone Mainstream

Score 46

  Jaehoon Lee Technical Content Manager, Nota AI   In March, a single official announcement from Google Research rocked trillions of won in the market capitalization of U.S. infrastructure and semiconductor stocks. The catalyst:...

inference serving kv-cache benchmark hardware model-release research training fine-tuning quantization agents frontier-model

Open

High signal Matched: inference, serving, generation, throughput, kv cache, benchmark, performance, cost, b200, blackwell, introducing, model, fp8, research, training, fine-tuning, quantization, quantized, agent, agentic, frontier model

Nota AI · korea · 2026-03-20

GenAI Everywhere: The Future of Edge AI Optimization with the New NetsPresso®

Score 26

  NP Product Team, Nota AI   The role of Edge AI is rapidly expanding.Offline voice assistants now carry on conversations in our daily lives, vehicles infer routes in real time, and smartphones generate images without a network c...

inference kv-cache moe benchmark model-release research korea quantization

Open

High signal Matched: inference, kv cache, moe, benchmark, performance, latency, cost, model, research, seoul, quantization

Nota AI · korea · 2026-03-13

NotaMoEQuantization: An MoE-Specific Quantization Method for Solar-Open-100B

Score 62

  Hancheol Park, Ph. D. AI Research Engineer, Nota AI Tairen PiaoAI Research Engineer, Nota AI Tae-Ho KimCTO & Co-Founder, Nota AI ✔️ Resource : The official quantized model of Solar-Open-100B, which passed the first round of Sout...

inference serving moe benchmark hardware model-release research korea training quantization evals long-context open-source

Open

High signal Matched: inference, serving, prefill, generation, throughput, moe, router, benchmark, performance, latency, ttft, tpot, blackwell, release, model, weights, open model, research, evaluation, korea, korean, upstage, training, post-training, quantization, quantized, int4, evaluate, benchmarks, mmlu, long-context

vLLM Project · open-source · 2026-02-13

DeepSeek-V3.2 on GB300: Performance Breakthrough

Score 22

DeepSeek-V3.2 (NVFP4 + TP2)has been successfully and smoothly run on GB300 (SM103 - Blackwell Ultra). Leveraging FP4 quantization, it achieves a single-GPU throughput of 7360 TGS (tokens / GPU /...

serving moe benchmark hardware quantization

Open

High signal Matched: throughput, deepseek-v3, performance, gpu, blackwell, quantization

Together AI · inference-infra · 2026-01-13

Learn how Cursor partnered with Together AI to deliver real-time, low-latency inference at scale

Score 24

Together AI teamed with Cursor to build the real-time inference stack that keeps in-editor agents fast and reliable. They productionized NVIDIA Blackwell (B200/GB200), tuning ARM hosts, kernels, and FP4/TensorRT quantization for low latenc...

inference benchmark hardware model-release quantization agents

Open

High signal Matched: inference, latency, b200, gb200, blackwell, model, quantization, agents

Nota AI · korea · 2025-12-19

NVIDIA Blackwell; The Impact of NVFP4 For LLM Inference

Score 74

  Seungmin YangEdgeFM Lead, Nota AI On this page ▾ SummaryWith the introduction of NVFP4—a new 4-bit floating point data type in NVIDIA’s Blackwell GPU architecture—LLM inference achieves markedly improved efficiency.Blackwell’s NVFP4...

inference serving kernel cuda distributed benchmark hardware model-release research training quantization evals rag

Open

High signal Matched: inference, serving, decoding, prefill, generation, token generation, throughput, kernel, gemm, cutlass, distributed, benchmark, performance, latency, ttft, tpot, tokens/sec, cost, gpu, blackwell, launch, model, weights, fp8, research, training, post-training, quantization, quantized, awq, benchmarks, mmlu, retrieval

vLLM Project · open-source · 2025-12-15

Run Highly Efficient and Accurate AI Agents with NVIDIA Nemotron 3 Nano on vLLM

Score 10

Jan 28th Update: NVIDIA just released their Nemotron 3 Nano model in NVFP4 precision. This model is supported by vLLM out of the box and it uses a new method called Quantization-Aware Distillation...

model-release quantization agents

Open

High signal Matched: model, quantization, agents

vLLM Project · open-source · 2025-12-09

Advancing Low‑Bit Quantization for LLMs: AutoRound x LLM Compressor

Score 10

Achieve faster, more efficient LLM serving without sacrificing accuracy!

inference serving quantization

Open

High signal Matched: serving, quantization

Together AI · inference-infra · 2025-12-01

Together AI delivers fastest inference for the top open-source models

Score 20

Together AI achieves up to 2x faster inference for top open-source models like Qwen, DeepSeek, and Kimi through GPU optimization, advanced speculative decoding, and FP4 quantization—ranking #1 in speed benchmarks on NVIDIA Blackwell archit...

inference speculative-decoding hardware quantization evals open-source

Open

High signal Matched: inference, decoding, speculative decoding, gpu, blackwell, quantization, benchmarks, open-source

SqueezeBits · korea · 2025-05-20

How to Quantize Transformer-based model for TensorRT Deployment

Score 12

This article describes the experimental results of quantized Vision Transformer model and its variants with OwLite.

model-release quantization

Open

High signal Matched: model, quantized

Nota AI · korea · 2025-05-08

SplitQuant: Layer Splitting for Low-Bit Neural Network Quantization for Edge AI Devices

Score 20

  Jaewoo SongSoftware Engineer, Nota AI   SummaryThis study proposes an AI model preprocessing method for improved quantization accuracies on edge AI devices which do not support advanced quantization methods due to their limitat...

benchmark model-release research quantization

Open

High signal Matched: performance, model, weights, research, quantization, int8, int4

SqueezeBits · korea · 2025-05-07

How to Quantize YOLO models with OwLite

Score 8

This article describes the experimental results of quantized YOLO models with OwLite.

quantization

Open

High signal Matched: quantized

Hugging Face · open-source · 2025-04-29

Introducing AutoRound: Intel’s Advanced Quantization for LLMs and VLMs

Score 10

No feed summary available yet.

model-release quantization

Open

High signal Matched: introducing, quantization

SqueezeBits · korea · 2025-04-11

OwLite: No More Compromising on AI Performance After Quantization

Score 16

Discover how OwLite simplifies AI model optimization with seamless integration and secure architecture.

benchmark model-release quantization

Open

High signal Matched: performance, model, quantization

SqueezeBits · korea · 2025-01-13

[Intel Gaudi] #4. FP8 Quantization

Score 20

In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.

benchmark hardware model-release quantization evals

Open

High signal Matched: performance, accelerator, fp8, quantization, evaluate

SqueezeBits · korea · 2024-11-18

[vLLM vs TensorRT-LLM] #8. KV Cache Quantization

Score 14

This article provides a comparative analysis of the effects of KV cache quantization on vLLM and TensorRT-LLM frameworks.

kv-cache quantization

Open

High signal Matched: kv cache, quantization

SqueezeBits · korea · 2024-11-11

[vLLM vs TensorRT-LLM] #7. Weight-Activation Quantization

Score 10

This article provides a comparative analysis of the effects of weight-activation quantization on vLLM and TensorRT-LLM frameworks.

quantization

Open

High signal Matched: quantization

SqueezeBits · korea · 2024-11-01

[vLLM vs TensorRT-LLM] #6. Weight-Only Quantization

Score 10

This article provides a comparative analysis of the effects of weight-only quantization on vLLM and TensorRT-LLM frameworks.

quantization

Open

High signal Matched: quantization

Modular · inference-infra · 2024-06-07

MAX 24.4 - Introducing quantization APIs and MAX on macOS

Score 10

MAX 24.4 - Introducing quantization APIs and MAX on macOS

model-release quantization

Open

High signal Matched: introducing, quantization

Hugging Face · open-source · 2024-05-16

Unlocking Longer Generation with Key-Value Cache Quantization

Score 10

No feed summary available yet.

inference quantization

Open

High signal Matched: generation, quantization

Hugging Face · open-source · 2025-05-21

Exploring Quantization Backends in Diffusers

Score 1

No feed summary available yet.

quantization

Open

Watchlist Matched: quantization

Hugging Face · open-source · 2024-09-18

Fine-tuning LLMs to 1.58bit: extreme quantization made easy

Score 1

No feed summary available yet.

fine-tuning quantization

Open

Watchlist Matched: fine-tuning, quantization

Modular · inference-infra · 2024-06-25

What's new in MAX 24.4? MAX on macOS, fast local Llama3, native quantization and GGUF support

Score 1

What's new in MAX 24.4? MAX on macOS, fast local Llama3, native quantization and GGUF support

quantization

Open

Watchlist Matched: quantization, gguf

Hugging Face · open-source · 2024-03-22

Binary and Scalar Embedding Quantization for Significantly Faster & Cheaper Retrieval

Score 1

No feed summary available yet.

quantization rag

Open

Watchlist Matched: quantization, retrieval

Hugging Face · open-source · 2024-03-18

Quanto: a PyTorch quantization backend for Optimum

Score 1

No feed summary available yet.

quantization

Open

Watchlist Matched: quantization

Hugging Face · open-source · 2023-09-12

Overview of natively supported quantization schemes in 🤗 Transformers

Score 1

No feed summary available yet.

quantization

Open

Watchlist Matched: quantization

Hugging Face · open-source · 2023-07-27

Stable Diffusion XL on Mac with Advanced Core ML Quantization

Score 1

No feed summary available yet.

quantization

Open

Watchlist Matched: quantization

Hugging Face · open-source · 2023-05-24

Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA

Score 1

No feed summary available yet.

fine-tuning quantization

Open

Watchlist Matched: qlora, quantization