MLSys Radar

open-source

Lambda · cloud · 2026-05-22

DeepSeek V4: the most expected open-source model ever released, and the quietest landing

Score 18

After 15 months of incremental updates, leaks, and rumored leaks, DeepSeek released version 4. It arrived without the fanfare R1 and R1-preview commanded in early 2025. That quiet reception is the most interesting thing about the release....

inference serving benchmark model-release open-source

Open

High signal Matched: inference, serving, performance, cost, release, model, open-source

AMD ROCm Blogs · hardware · 2026-05-22

From Build to Benchmark: ONNX Model Serving with Triton Inference Server on AMD GPUs

Score 30

Triton Inference Server is an open-source platform designed to streamline AI inferencing. It supports the deployment, scaling, and inference of trained models from multiple frameworks, including ONNX Runtime, TensorFlow, PyTorch, and other...

inference serving kernel triton benchmark model-release cloud open-source

Open

High signal Matched: inference, inferencing, serving, triton, benchmark, model, cloud, open-source

AMD ROCm Blogs · hardware · 2026-05-20

ROCm 7.13: Expanding Hardware, Tools, and Reach

Score 14

AMD released ROCm Core 7.13, the AMD GPU Driver 31.30, and AMD GPU Virtualization 9.0. With these releases, ROCm software expands hardware support across enterprise datacenters. The platform introduces AMD’s latest Instinct accelerators, e...

benchmark hardware open-source

Open

High signal Matched: performance, gpu, rocm, open-source

Nota AI · korea · 2026-04-22

[Deep Dive: NetsPresso®] From Quantization to Graph Optimization: A Step-by-Step Model Deployment Pipeline

Score 54

  Jaehoon Lee Technical Content Manager, Nota AI   Series Notice: NetsPresso® Technical Blog, Part 2In Part 1, we walked through a scenario of deploying Llama 3.2 1B on an edge device to illustrate the NetsPresso® workflow. The f...

inference kernel cuda benchmark hardware model-release research korea training quantization evals api open-source

Open

High signal Matched: inference, kernel, cuda, matmul, benchmark, performance, latency, cost, npu, model, weights, paper, research, evaluation, furiosa, training, quantization, int8, int4, awq, gptq, sdk, open-source

LMCache · open-source · 2026-04-18

LMCache: A Journey

Score 12

GTC wrapped up a month ago. Our open-source KV cache management library, LMCache, was shown in Jensen Huang’s keynote, was spotlighted by NVIDIA SVP Kevin Deierling, I was invited to speak at the first-ever industry KV cache tutorial...

kv-cache open-source

Open

High signal Matched: kv cache, lmcache, open-source

SqueezeBits · korea · 2026-04-14

Recap: 2nd vLLM Korea Meetup 2026

Score 12

Check out highlights from the 2nd vLLM Korea Meetup! open-source use cases and real-world production examples that showcase vLLM's technical maturity!

korea open-source

Open

High signal Matched: korea, open-source

Together AI · inference-infra · 2026-03-31

Aurora

Score 12

1.25x over a well-trained static speculator. Aurora is an open-source RL framework that turns speculative decoding from a one-time offline setup into a self-improving system that learns from every request it serves.

inference speculative-decoding open-source

Open

High signal Matched: decoding, speculative decoding, open-source

Nota AI · korea · 2026-03-23

[GTC 2026 Recap] The Trillion-Dollar Inference Race Begins: How Nota AI Fills the Gap

Score 42

  Jaehoon Lee Technical Content Manager, Nota AI   GTC has evolved far beyond a technology conference, drawing attention from global economies and financial markets alike. This year, CEO Jensen Huang took the stage in his tradema...

inference serving kernel cuda kv-cache benchmark hardware model-release research cloud training long-context agents open-source

Open

High signal Matched: inference, prefill, generation, throughput, cuda, kv cache, performance, latency, cost, gpu, npu, launch, model, research, cloud, training, long-context, context window, agent, agents, agentic, open-source

Together AI · inference-infra · 2026-03-17

Mamba-3

Score 10

Meet Mamba-3: the SSM built for inference. Faster than Transformers at decode, stronger than Mamba-2, and open-source from day one.

inference open-source

Open

High signal Matched: inference, open-source

Nota AI · korea · 2026-03-13

NotaMoEQuantization: An MoE-Specific Quantization Method for Solar-Open-100B

Score 62

  Hancheol Park, Ph. D. AI Research Engineer, Nota AI Tairen PiaoAI Research Engineer, Nota AI Tae-Ho KimCTO & Co-Founder, Nota AI ✔️ Resource : The official quantized model of Solar-Open-100B, which passed the first round of Sout...

inference serving moe benchmark hardware model-release research korea training quantization evals long-context open-source

Open

High signal Matched: inference, serving, prefill, generation, throughput, moe, router, benchmark, performance, latency, ttft, tpot, blackwell, release, model, weights, open model, research, evaluation, korea, korean, upstage, training, post-training, quantization, quantized, int4, evaluate, benchmarks, mmlu, long-context

Together AI · inference-infra · 2026-03-02

Introducing Together AI’s new look

Score 14

We've refreshed our visual identity — designed with Pentagram to express how Together AI connects open-source innovation, systems research, and builders to unlock new possibilities.

model-release research open-source

Open

High signal Matched: introducing, research, open-source

Together AI · inference-infra · 2026-02-02

Fine-tuning open LLM judges to outperform GPT-5.2

Score 14

Fine-tuned open-source LLM judges can outperform GPT-5.2 at evaluating model outputs. Using Direct Preference Optimization on just 5,400 preference pairs, we trained GPT-OSS 120B to beat GPT-5.2 on human preference alignment—at 15x lower c...

inference benchmark model-release fine-tuning evals open-source

Open

High signal Matched: inference, cost, model, fine-tuning, evaluating, open-source, oss

Together AI · inference-infra · 2026-01-26

DSGym: A holistic framework for evaluating and training data science agents

Score 18

Introducing DSGym—a holisti evaluation and training framework for LLM-based data science agents. Features 90+ bioinformatics tasks, 92 Kaggle competitions, and synthetic trajectory generation. Our 4B model achieves state-of-the-art perform...

inference benchmark model-release research training evals agents open-source

Open

High signal Matched: generation, performance, introducing, model, evaluation, training, evaluating, agents, open-source

Together AI · inference-infra · 2025-12-01

Together AI delivers fastest inference for the top open-source models

Score 20

Together AI achieves up to 2x faster inference for top open-source models like Qwen, DeepSeek, and Kimi through GPU optimization, advanced speculative decoding, and FP4 quantization—ranking #1 in speed benchmarks on NVIDIA Blackwell archit...

inference speculative-decoding hardware quantization evals open-source

Open

High signal Matched: inference, decoding, speculative decoding, gpu, blackwell, quantization, benchmarks, open-source

SkyPilot · open-source · 2025-08-12

Self-host open-source LLM agent sandbox on your own cloud

Score 10

Your AI writes code. Now what? If you’re building AI agents in 2025, you probably wondered that as well. Your LLM generates some Python code that analyzes data, manipulates files, or calls APIs. But where does it run? Most people eit...

cloud agents open-source

Open

High signal Matched: cloud, agent, agents, open-source

llm-d · open-source · 2025-05-20

llm-d Press Release

Score 20

Red Hat launches llm-d: Open source distributed AI inference platform backed by NVIDIA, Google Cloud, IBM. Scale generative AI with intelligent routing on Kubernetes.

inference distributed model-release cloud open-source

Open

High signal Matched: inference, distributed, release, cloud, open source

Nota AI · korea · 2025-05-07

Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features</span#x3E;

Score 28

&nbsp; Jewon Lee | Ki-Ung Song | Seungmin Yang | Donguk Lim | Jaeyeon Kim | Wooksu Shin | Bo-Kyeong Kim | Tae-Ho KimEdgeFM Team, Nota AI Yong Jae Lee, Ph. D.Associate Professor, UW-Madison &nbsp; SummaryOur method, Trimmed-Llama, reduces t...

inference kv-cache benchmark model-release research training evals open-source

Open

High signal Matched: inference, generation, kv cache, benchmark, performance, latency, model, weights, research, training, benchmarks, open-source

SqueezeBits · korea · 2025-03-26

TensorRT-LLM Goes Open Source!

Score 12

With TensorRT-LLM now open source, we can finally take a deep dive into the secret sauce behind its impressive performance.

benchmark open-source

Open

High signal Matched: performance, open source

AIBrix · open-source · 2025-02-21

Introducing AIBrix: Cost-Effective and Scalable Control Plane for vLLM

Score 26

Open-source large language models (LLMs) like LLaMA, Deepseek, Qwen and Mistral etc have surged in popularity, offering enterprises greater flexibility, cost savings, and control over their AI deployments. These models have empowered organ...

inference benchmark model-release agents open-source

Open

High signal Matched: inference, generation, latency, cost, introducing, model, agents, open-source

AIBrix · open-source · 2024-11-13

Introducing AIBrix v0.1.0: Building the Future of Scalable, Cost-Effective AI Infrastructure for Large Models

Score 32

In recent years, large language models (LLMs) have revolutionized AI applications, powering solutions in areas like chatbots, automated content generation, and advanced recommendation engines. Services like OpenAI’s have gained significant...

inference kv-cache benchmark hardware model-release cloud open-source

Open

High signal Matched: decoding, prefill, generation, kv cache, performance, cost, gpu, release, introducing, cloud, open-source

Nota AI · korea · 2024-08-02

Deploying an Efficient Vision-Language Model on Mobile Devices

Score 38

&nbsp; Jaeyeon KimResearch Engineer, Nota AI Geonmin KimResearch Engineer, Nota AI Hancheol ParkTeam Lead of NetsPresso Application, Nota AI &nbsp; IntroductionRecent large language models (LLMs) have demonstrated unprecedented performance...

inference benchmark model-release research cloud training fine-tuning evals open-source

Open

High signal Matched: decoding, benchmark, performance, latency, tokens/sec, model, arxiv, research, technical report, evaluation, cloud, training, lora, benchmarks, leaderboard, open-source

Replicate · inference-infra · 2023-07-27

Run Llama 2 with an API

Score 8

Llama 2 is the first open source language model of the same caliber as OpenAI’s models. Learn how to run it in the cloud with one line of code.

model-release cloud api open-source

Open

High signal Matched: model, cloud, api, open source

Lambda · cloud · 2026-04-30

Creating highly efficient agents: 450M tool-calling tokens distilled for post-training from top open-source models

Score 4

Harnesses If you've used Claude Code or Codex, you've used a harness. A harness is the infrastructure layer that wraps an AI coding agent and decides how it operates, what it can touch, and how you measure whether it worked. It's how most...

hardware training agents open-source

Open

Watchlist Matched: gpu, training, post-training, agent, agents, open-source

Replicate · inference-infra · 2024-11-26

FLUX fine-tunes are now fast

Score 0

We've made running fine-tunes on Replicate much faster, and the optimizations are open-source.

open-source

Open

Watchlist Matched: open-source

Replicate · inference-infra · 2024-10-10

FLUX is fast and it's open source

Score 0

FLUX is now much faster on Replicate, and we’ve made our optimizations open-source so you can see exactly how they work and build upon them.

open-source

Open

Watchlist Matched: open-source, open source

Replicate · inference-infra · 2024-08-01

Run FLUX with an API

Score 6

FLUX.1 is a new text-to-image model from Black Forest Labs, the creators of Stable Diffusion, that exceeds the capabilities of previous open-source models.

model-release api open-source

Open

Watchlist Matched: model, api, open-source

Replicate · inference-infra · 2024-05-24

Replicate Intelligence #1

Score 0

DIY Llama 3 implementation, open-source smart glasses, steering language models with dictionary learning

open-source

Open

Watchlist Matched: open-source