SqueezeBits - MLSys Blogs

SqueezeBits · korea · 2026-04-14

Recap: 2nd vLLM Korea Meetup 2026

Score 12

Check out highlights from the 2nd vLLM Korea Meetup! open-source use cases and real-world production examples that showcase vLLM's technical maturity!

korea open-source

Open

High signal Matched: korea, open-source

SqueezeBits · korea · 2026-03-11

Reliable & Scalable Synthetic Data for Physical AI (Part 2): Making Cosmos 3.1 x Faster for Production

Score 12

Explore why Physical AI deployment needs synthetic data at scale with Squeezebits' research and discover how to overcome inference bottlenecks to accelerate Roboost Agent.

inference research agents

Open

High signal Matched: inference, research, agent

SqueezeBits · korea · 2026-02-25

Reliable & Scalable Synthetic Data for Physical AI (Part 1): Taming NVIDIA Cosmos with RoBoost Agent

Score 10

Scaling Physical AI requires reliable synthetic data. Learn how RoBoost Agent integrates NVIDIA Cosmos to transform world models into trustworthy data engines for robotics and autonomous driving.

agents

Open

High signal Matched: agent

SqueezeBits · korea · 2026-01-07

Intel® Gaudi® Hands-on Workshop | A Recap of the Gaudi Workshop with SqueezeBits x Lablup

Score 12

A recap of the Intel® Gaudi® hands-on workshop co-hosted by SqueezeBits and Lablup. AI model compression, fine-tuning, and vLLM serving on Gaudi® hardware with Backend.AI.

inference serving model-release fine-tuning

Open

High signal Matched: serving, model, fine-tuning

SqueezeBits · korea · 2025-12-24

Introducing rebellions ATOM™-MAX

Score 24

Introducing ATOM™-Max, rebellions’ next-generation NPU designed for high-performance AI inference. Learn how its runtime, profiling tools, and PyTorch-native integrations enable developers to run and serve models efficiently without sacrif...

inference serving benchmark hardware model-release korea

Open

High signal Matched: inference, generation, serve, performance, npu, introducing, rebellions

SqueezeBits · korea · 2025-12-10

vLLM Hands-on Workshop with Rebellions & SqueezeBits: A Recap

Score 12

Rebellions and SqueezeBits Co-Host a vLLM Hands-on Workshop: Workshop Highlights, PyTorch Best Practices, Performance Optimization, and Developer First-Hand Tips!

benchmark korea

Open

High signal Matched: performance, rebellions

SqueezeBits · korea · 2025-10-31

Winning both speed and quality: How Yetter deals with diffusion models

Score 16

Explore how the Yetter Inference Engine overcomes the limitations of step caching and model distillation for diffusion models. We analyze latency, diversity, quality, and negative-prompt handling to reveal what truly matters for scalable,...

inference benchmark model-release

Open

High signal Matched: inference, generation, latency, model

SqueezeBits · korea · 2025-10-28

[Intel Gaudi] #6. GEMM, Attention, vLLM on Gaudi

Score 20

Explore how Intel’s new Gaudi-3 compares to Gaudi-2, NVIDIA A100, and H100. We analyze real-world GEMM efficiency, attention performance, and LLM serving results to uncover what truly matters for AI inference and training workloads.

inference serving kernel benchmark hardware training

Open

High signal Matched: inference, serving, gemm, performance, h100, training

SqueezeBits · korea · 2025-10-02

Yetter, the GenAI API service: AI Optimization, Out of the Box

Score 14

Meet 'Yetter': the generative AI API service built for speed, efficiency, and scalability. Powered by our optimization inference engine, it delivers reliable image, video, and future LLM services at a fraction of the cost.

inference benchmark api

Open

High signal Matched: inference, cost, api

SqueezeBits · korea · 2025-09-16

Guided Decoding Performance on vLLM and SGLang

Score 16

The guide to LLM guided decoding! This deep-dive benchmark compares XGrammar and LLGuidance on vLLM and SGLang to help you find the optimal setup for generating structured output based on your use case.

inference benchmark

Open

High signal Matched: decoding, benchmark, performance

SqueezeBits · korea · 2025-08-26

Disaggregated Inference on Apple Silicon: NPU prefill and GPU decode

Score 22

In this article, we introduce how to run LLMs efficiently on Apple Silicon with disaggregated inference technique.

inference hardware

Open

High signal Matched: inference, prefill, gpu, npu

SqueezeBits · korea · 2025-08-20

[Efficient AI Study] AI Model Compression Community Study and Meetup

Score 12

Efficient AI Study & Meetup recap: SqueezeBits' community study on AI model compression, featuring paper reviews, participant interviews, and networking from the offline meetup.

model-release research

Open

High signal Matched: model, paper

SqueezeBits · korea · 2025-08-04

Vocabulary Trimming: An Easy and Effective Method for SLM Acceleration

Score 10

Trimming large multilingual vocabularies in Small Language Models (SLM) is a simple, low-risk way to boost efficiency to its limit. It accelerates the model inference significantly while keeping accuracy almost unchanged.

inference model-release

Open

High signal Matched: inference, model

SqueezeBits · korea · 2025-07-21

GraLoRA: Boosting Fine-Tuning Accuracy Without Extra Cost

Score 20

LoRA excels at efficient fine-tuning but suffers at higher ranks due to gradient entanglement. We introduce GraLoRA, which addresses these issues through finer-grained, block-wise updates, significantly enhancing performance and expressivi...

benchmark fine-tuning

Open

High signal Matched: performance, cost, fine-tuning, lora

SqueezeBits · korea · 2025-07-03

OwLite Meets Qualcomm Neural Network: Unlocking On-Device AI Performance

Score 10

At SqueezeBits we have been empowering developers to efficiently deploy complex AI models while minimizing performance trade-offs with OwLite toolkit. With OwLite v2.5, we're excited to announce official support for Qualcomm Neural Network...

benchmark

Open

High signal Matched: performance

SqueezeBits · korea · 2025-07-01

Bringing NPUs into Production: Our Journey with Intel Gaudi

Score 8

SqueezeBits has partnered with Intel to make Gaudi NPUs more usable in practice. We optimized LLMs and diffusion models for Gaudi-2 and created yetter, a generative AI API service.

api

Open

High signal Matched: api

SqueezeBits · korea · 2025-06-10

[Japan IT Week Spring 2025] What We Saw on the Global AI Frontline in Tokyo

Score 8

SqueezeBits at Japan IT Week Spring 2025 in Tokyo: AI model compression demos, OwLite and Fits on Chips introductions, Japan market entry experiences, and team stories from the frontline.

model-release

Open

High signal Matched: model

SqueezeBits · korea · 2025-05-20

How to Quantize Transformer-based model for TensorRT Deployment

Score 12

This article describes the experimental results of quantized Vision Transformer model and its variants with OwLite.

model-release quantization

Open

High signal Matched: model, quantized

SqueezeBits · korea · 2025-05-07

How to Quantize YOLO models with OwLite

Score 8

This article describes the experimental results of quantized YOLO models with OwLite.

quantization

Open

High signal Matched: quantized

SqueezeBits · korea · 2025-04-11

OwLite: No More Compromising on AI Performance After Quantization

Score 16

Discover how OwLite simplifies AI model optimization with seamless integration and secure architecture.

benchmark model-release quantization

Open

High signal Matched: performance, model, quantization

SqueezeBits · korea · 2025-04-02

[Intel Gaudi] #5. FLUX.1 on Gaudi-2

Score 8

This article discusses inference efficiency when running the FLUX.1 models on Intel Gaudi-2 hardware.

inference

Open

High signal Matched: inference

SqueezeBits · korea · 2025-03-26

TensorRT-LLM Goes Open Source!

Score 12

With TensorRT-LLM now open source, we can finally take a deep dive into the secret sauce behind its impressive performance.

benchmark open-source

Open

High signal Matched: performance, open source

SqueezeBits · korea · 2025-02-27

Fits on Chips: Saving LLM Costs Became Easier Than Ever

Score 10

This article introduces Fits on Chips, an LLMOps toolkit for performance evaluation.

benchmark research evals

Open

High signal Matched: performance, evaluation

SqueezeBits · korea · 2025-02-17

SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks

Score 14

A brief review of the research paper from our team, published at ICML 2024.

speculative-decoding research

Open

High signal Matched: verification, paper, research

SqueezeBits · korea · 2025-02-10

The Missing Piece of TensorRT-LLM

Score 8

This article is about an open-source library for direct conversion of PyTorch models to TensorRT-LLM.

open-source

Open

High signal Matched: open-source

SqueezeBits · korea · 2025-01-20

[vLLM vs TensorRT-LLM] #13. Vision-Language Models

Score 8

This article provides a comparative analysis of serving vision-language models on vLLM and TensorRT-LLM.

inference serving

Open

High signal Matched: serving

SqueezeBits · korea · 2025-01-13

[Intel Gaudi] #4. FP8 Quantization

Score 20

In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.

benchmark hardware model-release quantization evals

Open

High signal Matched: performance, accelerator, fp8, quantization, evaluate

SqueezeBits · korea · 2025-01-06

[Intel Gaudi] #3. Performance Evaluation with SynapseAI v1.19

Score 18

In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.

benchmark hardware research evals

Open

High signal Matched: performance, accelerator, evaluation, evaluate

SqueezeBits · korea · 2024-12-09

[vLLM vs TensorRT-LLM] #11. Speculative Decoding

Score 14

This article provides a comparative analysis of speculative decoding.

inference speculative-decoding

Open

High signal Matched: decoding, speculative decoding

SqueezeBits · korea · 2024-12-05

[vLLM vs TensorRT-LLM] #10 Serving Multiple LoRAs at Once

Score 14

This article provides a comparative analysis of multi-LoRA serving capabilities of vLLM and TensorRT-LLM frameworks.

inference serving fine-tuning

Open

High signal Matched: serving, lora

SqueezeBits · korea · 2024-12-03

[Intel Gaudi] #2. Graph Compiler and Overall Performance Evaluation

Score 18

In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.

benchmark hardware research evals

Open

High signal Matched: performance, accelerator, evaluation, evaluate

SqueezeBits · korea · 2024-11-21

[Intel Gaudi] #1. Introduction

Score 12

In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.

benchmark hardware evals

Open

High signal Matched: performance, accelerator, evaluate

SqueezeBits · korea · 2024-11-18

[vLLM vs TensorRT-LLM] #8. KV Cache Quantization

Score 14

This article provides a comparative analysis of the effects of KV cache quantization on vLLM and TensorRT-LLM frameworks.

kv-cache quantization

Open

High signal Matched: kv cache, quantization

SqueezeBits · korea · 2024-11-11

[vLLM vs TensorRT-LLM] #7. Weight-Activation Quantization

Score 10

This article provides a comparative analysis of the effects of weight-activation quantization on vLLM and TensorRT-LLM frameworks.

quantization

Open

High signal Matched: quantization

SqueezeBits · korea · 2024-11-01

[vLLM vs TensorRT-LLM] #6. Weight-Only Quantization

Score 10

This article provides a comparative analysis of the effects of weight-only quantization on vLLM and TensorRT-LLM frameworks.

quantization

Open

High signal Matched: quantization

SqueezeBits · korea · 2024-10-30

[vLLM vs TensorRT-LLM] #5. Dynamic Sequence Lengths

Score 8

This article provides a comparative analysis of vLLM and TensorRT-LLM frameworks, focusing on performance with fixed and dynamic datasets.

benchmark

Open

High signal Matched: performance

SqueezeBits · korea · 2024-10-18

[vLLM vs TensorRT-LLM] #3. Understanding Sampling Methods and Their Performance Impact

Score 10

This article provides a comparative analysis of vLLM and TensorRT-LLM frameworks with various sampling methods.

benchmark

Open

High signal Matched: performance

SqueezeBits · korea · 2024-10-11

[vLLM vs TensorRT-LLM] #2. Towards Optimal Batching for LLM Serving

Score 10

This article provides a comparative analysis of vLLM and TensorRT-LLM frameworks, focusing on batching configurations and thoroughly examining the effects of maximum batch size and maximum number of tokens.

inference serving

Open

High signal Matched: serving

SqueezeBits · korea · 2024-10-01

[vLLM vs TensorRT-LLM] #1. An Overall Evaluation

Score 22

This article provides a comparative analysis of vLLM and TensorRT-LLM frameworks for serving LLMs, evaluating their performance based on key metrics like throughput, TTFT, and TPOT to offer insights for practitioners in optimizing LLM depl...

inference serving benchmark research evals

Open

High signal Matched: serving, throughput, performance, ttft, tpot, evaluation, evaluating

SqueezeBits · korea · 2024-06-26

How much can we save through compression?

Score 10

Estimating the cost savings from model compression.

benchmark model-release

Open

High signal Matched: cost, model

SqueezeBits · korea · 2024-04-24

Accuracy Degradation in AI Compression: Myth or Truth?

Score 8

Clarifying the misunderstandings in AI model compression

model-release

Open

High signal Matched: model

SqueezeBits · korea · 2024-04-23

Are you getting everything out of your GPUs?

Score 12

The Blackwell GPU from GTC 2024 was astonishing.Analysis of the Nvidia GPU evolution & what it means for GPU users.

hardware

Open

High signal Matched: gpu, blackwell

SqueezeBits · korea · 2024-04-19

Things to check if your business utilizes AI

Score 8

Do I need to COMPRESS my AI model? : the short answer is “YES” — and here’s why.

model-release

Open

High signal Matched: model

SqueezeBits · korea · 2024-04-15

AI Compression for Acceleration: 4 Key Methods.

Score 8

AI model compression for acceleration is essential. The question is HOW? Here are 4 key methodologies.

model-release

Open

High signal Matched: model

SqueezeBits · korea · 2026-05-28

2026 Efficient AI Offline Meetup

Score 2

Wrap up 8 weeks of online studies and take a look at how SqueezeBits makes an effort to maintain the AI compression community to expand!

Open

Watchlist Matched: none

SqueezeBits · korea · 2026-03-27

Our Experience Running a Booth at GTC 2026

Score 0

Sharing GTC 2026 insights, which is the Largest AI Industry Conference for developers! If you’ve ever wondered what it’s like for an AI startup to run a booth at such a massive event, you won’t want to miss this!

Open

Watchlist Matched: none

SqueezeBits · korea · 2025-04-02

Field Notes from the Global AI Market: Our Overseas Event Recap

Score 0

From Edge AI to NVIDIA GTC: Squeezebits team members share firsthand stories from global AI events, including networking insights, technical trends, and conference experiences.

Open

Watchlist Matched: none

SqueezeBits · korea · 2025-03-10

When Should I Use Fits on Chips?

Score 1

This article describes when to use Fits on Chips toolkit with specific use cases.

Open

Watchlist Matched: none

SqueezeBits · korea · 2025-02-06

The Rise and Fall of ONNX (feat. PyTorch 2.0)

Score 1

This article explores the rise and fall of ONNX, from its early success as a unifying stasndard for AI frameworks to its gradual shift into a niche tool in the era of PyTorch 2.0.

Open

Watchlist Matched: none

SqueezeBits · korea · 2024-12-23

[vLLM vs TensorRT-LLM] #12. Automatic Prefix Caching

Score 1

This article provides a comparative analysis of automatic prefix caching.

Open

Watchlist Matched: none

SqueezeBits · korea · 2024-11-26

[vLLM vs TensorRT-LLM] #9. Parallelism Strategies

Score 1

This article provides a comparative analysis of different parallelism strategies on vLLM and TensorRT-LLM frameworks.

Open

Watchlist Matched: none

SqueezeBits · korea · 2024-10-24

[vLLM vs TensorRT-LLM] #4. Which Scheduler Wins? 🔥

Score 1

This article provides a comparative analysis of schedulers in vLLM and TensorRT-LLM frameworks.

Open

Watchlist Matched: none

SqueezeBits · korea · 2024-05-27

Experiencing AI Model Compression Firsthand: Our IT Exhibition Story

Score 2

SqueezeBits' IT exhibition recap: from AI model compression demos to hands-on OwLite experiences, booth visitor reactions, and more. Read our on-the-ground event story!

model-release

Open

Watchlist Matched: model

SqueezeBits · korea · 2024-05-16

‘Breaking Down’ Tokenizers in LLMs

Score 1

An introduction to tokenizers and their implications in language models.

Open

Watchlist Matched: none