vLLM Project · open-source · 2026-06-02
Score 13
We are excited to announce that AutoRound — Intel's state-of-the-art post-training quantization (PTQ) algorithm — is now fully integrated into vLLM-Omni, enabling a streamlined quantize-once,...
High signal Matched: inference, training, post-training, quantization
Nota AI · korea · 2026-05-29
Score 23
Jaehoon Lee Technical Content Manager, Nota AI When enterprises adopt AI, the most common bottleneck is not model development. It is the deployment stage: getting a finished model to run reliably on the actual target device.T...
High signal Matched: inference, throughput, benchmark, performance, latency, cost, gpu, model, evaluation, quantization, int8, benchmarks, leaderboard
AMD ROCm Blogs · hardware · 2026-05-27
Score 17
Our previous two posts in this GEMM optimization series covered Matrix Core instructions and 8-wave ping-pong FP8 GEMM design. Here we discuss another algorithm design introduced by HipKittens - 4-wave interleave, which further improves th...
High signal Matched: gemm, performance, fp8
AMD ROCm Blogs · hardware · 2026-05-25
Score 20
Local large language model (LLM) inference has rapidly evolved, but a persistent limitation remains: model size is constrained by available GPU memory. Discrete GPUs typically offer 8–24 GB of dedicated VRAM, which can limit the size of mo...
High signal Matched: inference, multi-gpu, gpu, model, checkpoint, cloud, quantization, evaluate
AMD ROCm Blogs · hardware · 2026-05-20
Score 12
Large Language Models (LLMs) typically contain billions — or even tens of billions — of parameters. During inference, tensor parallelism is commonly employed to distribute the workload across multiple GPUs. This approach demands frequent,...
High signal Matched: inference, latency, introducing, quantization
LMCache · open-source · 2026-05-13
Score 20
A practitioner’s guide to KV-cache tiering on ROCm — what works, what doesn’t, and the regime where it actually matters. Key Summary We benchmarked multi-turn agentic workloads using 739 anonymized Claude Code conversation trac...
High signal Matched: lmcache, moe, mi300x, rocm, fp8, agentic
Nota AI · korea · 2026-05-11
Score 52
Jaehoon Lee Technical Content Manager, Nota AI NetsPresso® now embraces AI agents. An easy-to-use interface sits on top of the validated pipeline that handles everything from model compression to device deployment.When a user...
High signal Matched: inference, endpoint, kernel, verification, moe, benchmark, latency, cost, gpu, release, model, evaluation, quantization, quantized, int4, evaluate, benchmarks, swe-bench, mmlu, agent, agents, api
vLLM Project · open-source · 2026-05-11
Score 14
TurboQuant, a method for KV-cache quantization, recently gained significant traction in the community due to the large advertised savings in GPU memory from very low bit-width quantization of a...
High signal Matched: performance, gpu, quantization
NVIDIA Technical Blog · hardware · 2026-05-07
Score 16
Model quantization is an effective method to reduce VRAM usage and improve inference performance on consumer devices such as NVIDIA GeForce RTX GPUs. By...
High signal Matched: inference, performance, model, training, post-training, quantization
Nota AI · korea · 2026-04-29
Score 32
Hancheol Park, Ph. D.AI Research Engineer, NetsPresso Tech, Nota AI Geonmin Kim, Ph. D.AI Research Engineer, NetsPresso Tech, Nota AI Geonho LeeEdge AI Engineer Intern, NetsPresso Tech, Nota AI Jaehoon Lee Technical Content Manager,...
High signal Matched: generation, moe, performance, model, weights, paper, research, evaluation, korea, korean, seoul, naver, training, fine-tuning, quantization, agent, agents, agentic
Nota AI · korea · 2026-04-22
Score 54
Jaehoon Lee Technical Content Manager, Nota AI Series Notice: NetsPresso® Technical Blog, Part 2In Part 1, we walked through a scenario of deploying Llama 3.2 1B on an edge device to illustrate the NetsPresso® workflow. The f...
High signal Matched: inference, kernel, cuda, matmul, benchmark, performance, latency, cost, npu, model, weights, paper, research, evaluation, furiosa, training, quantization, int8, int4, awq, gptq, sdk, open-source
vLLM Project · open-source · 2026-04-22
Score 18
Long-context LLM serving is increasingly memory-bound: for standard full-attention decoders, the KV cache often dominates GPU memory at 128k+ contexts, and each decode step must read a large...
High signal Matched: serving, kv cache, gpu, fp8, quantization, long-context
NVIDIA Technical Blog · hardware · 2026-04-20
Score 18
As LLMs transition from simple text generation to complex reasoning, reinforcement learning (RL) plays a central role. Algorithms like Group Relative Policy...
High signal Matched: generation, throughput, fp8, training
Nota AI · korea · 2026-04-08
Score 36
Jaehoon Lee Technical Content Manager, Nota AI AI Model Optimization: Why Models Won't Run on HardwareThe Chip Is Ready, but the Model Won't DeployIf you have ever tried deploying an AI model onto your own chip, the following...
High signal Matched: inference, multi-gpu, kv cache, verification, performance, latency, gpu, model, research, evaluation, quantization, quantized, awq, gptq, evaluate
Nota AI · korea · 2026-03-31
Score 46
Jaehoon Lee Technical Content Manager, Nota AI In March, a single official announcement from Google Research rocked trillions of won in the market capitalization of U.S. infrastructure and semiconductor stocks. The catalyst:...
High signal Matched: inference, serving, generation, throughput, kv cache, benchmark, performance, cost, b200, blackwell, introducing, model, fp8, research, training, fine-tuning, quantization, quantized, agent, agentic, frontier model
Nota AI · korea · 2026-03-20
Score 26
NP Product Team, Nota AI The role of Edge AI is rapidly expanding.Offline voice assistants now carry on conversations in our daily lives, vehicles infer routes in real time, and smartphones generate images without a network c...
High signal Matched: inference, kv cache, moe, benchmark, performance, latency, cost, model, research, seoul, quantization
Nota AI · korea · 2026-03-13
Score 62
Hancheol Park, Ph. D. AI Research Engineer, Nota AI Tairen PiaoAI Research Engineer, Nota AI Tae-Ho KimCTO & Co-Founder, Nota AI ✔️ Resource : The official quantized model of Solar-Open-100B, which passed the first round of Sout...
High signal Matched: inference, serving, prefill, generation, throughput, moe, router, benchmark, performance, latency, ttft, tpot, blackwell, release, model, weights, open model, research, evaluation, korea, korean, upstage, training, post-training, quantization, quantized, int4, evaluate, benchmarks, mmlu, long-context
vLLM Project · open-source · 2026-02-13
Score 22
DeepSeek-V3.2 (NVFP4 + TP2)has been successfully and smoothly run on GB300 (SM103 - Blackwell Ultra). Leveraging FP4 quantization, it achieves a single-GPU throughput of 7360 TGS (tokens / GPU /...
High signal Matched: throughput, deepseek-v3, performance, gpu, blackwell, quantization
Together AI · inference-infra · 2026-01-13
Score 24
Together AI teamed with Cursor to build the real-time inference stack that keeps in-editor agents fast and reliable. They productionized NVIDIA Blackwell (B200/GB200), tuning ARM hosts, kernels, and FP4/TensorRT quantization for low latenc...
High signal Matched: inference, latency, b200, gb200, blackwell, model, quantization, agents
Nota AI · korea · 2025-12-19
Score 74
Seungmin YangEdgeFM Lead, Nota AI On this page ▾ SummaryWith the introduction of NVFP4—a new 4-bit floating point data type in NVIDIA’s Blackwell GPU architecture—LLM inference achieves markedly improved efficiency.Blackwell’s NVFP4...
High signal Matched: inference, serving, decoding, prefill, generation, token generation, throughput, kernel, gemm, cutlass, distributed, benchmark, performance, latency, ttft, tpot, tokens/sec, cost, gpu, blackwell, launch, model, weights, fp8, research, training, post-training, quantization, quantized, awq, benchmarks, mmlu, retrieval
vLLM Project · open-source · 2025-12-15
Score 10
Jan 28th Update: NVIDIA just released their Nemotron 3 Nano model in NVFP4 precision. This model is supported by vLLM out of the box and it uses a new method called Quantization-Aware Distillation...
High signal Matched: model, quantization, agents
vLLM Project · open-source · 2025-12-09
Score 10
Achieve faster, more efficient LLM serving without sacrificing accuracy!
High signal Matched: serving, quantization
Together AI · inference-infra · 2025-12-01
Score 20
Together AI achieves up to 2x faster inference for top open-source models like Qwen, DeepSeek, and Kimi through GPU optimization, advanced speculative decoding, and FP4 quantization—ranking #1 in speed benchmarks on NVIDIA Blackwell archit...
High signal Matched: inference, decoding, speculative decoding, gpu, blackwell, quantization, benchmarks, open-source
SqueezeBits · korea · 2025-05-20
Score 12
This article describes the experimental results of quantized Vision Transformer model and its variants with OwLite.
High signal Matched: model, quantized
Nota AI · korea · 2025-05-08
Score 20
Jaewoo SongSoftware Engineer, Nota AI SummaryThis study proposes an AI model preprocessing method for improved quantization accuracies on edge AI devices which do not support advanced quantization methods due to their limitat...
High signal Matched: performance, model, weights, research, quantization, int8, int4
SqueezeBits · korea · 2025-05-07
Score 8
This article describes the experimental results of quantized YOLO models with OwLite.
High signal Matched: quantized
Hugging Face · open-source · 2025-04-29
Score 10
No feed summary available yet.
High signal Matched: introducing, quantization
SqueezeBits · korea · 2025-04-11
Score 16
Discover how OwLite simplifies AI model optimization with seamless integration and secure architecture.
High signal Matched: performance, model, quantization
SqueezeBits · korea · 2025-01-13
Score 20
In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.
High signal Matched: performance, accelerator, fp8, quantization, evaluate
SqueezeBits · korea · 2024-11-18
Score 14
This article provides a comparative analysis of the effects of KV cache quantization on vLLM and TensorRT-LLM frameworks.
High signal Matched: kv cache, quantization
SqueezeBits · korea · 2024-11-11
Score 10
This article provides a comparative analysis of the effects of weight-activation quantization on vLLM and TensorRT-LLM frameworks.
High signal Matched: quantization
SqueezeBits · korea · 2024-11-01
Score 10
This article provides a comparative analysis of the effects of weight-only quantization on vLLM and TensorRT-LLM frameworks.
High signal Matched: quantization
Modular · inference-infra · 2024-06-07
Score 10
MAX 24.4 - Introducing quantization APIs and MAX on macOS
High signal Matched: introducing, quantization
Hugging Face · open-source · 2024-05-16
Score 10
No feed summary available yet.
High signal Matched: generation, quantization
Hugging Face · open-source · 2025-05-21
Score 1
No feed summary available yet.
Watchlist Matched: quantization
Hugging Face · open-source · 2024-09-18
Score 1
No feed summary available yet.
Watchlist Matched: fine-tuning, quantization
Modular · inference-infra · 2024-06-25
Score 1
What's new in MAX 24.4? MAX on macOS, fast local Llama3, native quantization and GGUF support
Watchlist Matched: quantization, gguf
Hugging Face · open-source · 2024-03-22
Score 1
No feed summary available yet.
Watchlist Matched: quantization, retrieval
Hugging Face · open-source · 2024-03-18
Score 1
No feed summary available yet.
Watchlist Matched: quantization
Hugging Face · open-source · 2023-09-12
Score 1
No feed summary available yet.
Watchlist Matched: quantization
Hugging Face · open-source · 2023-07-27
Score 1
No feed summary available yet.
Watchlist Matched: quantization
Hugging Face · open-source · 2023-05-24
Score 1
No feed summary available yet.
Watchlist Matched: qlora, quantization