MLSys Radar

model-release

FriendliAI · inference-infra · 2026-06-03

Model APIs

Score 15

No feed summary available yet.

model-release

Open

High signal Matched: model

AWS Machine Learning Blog · cloud · 2026-06-03

The art and science of hyperparameter optimization on Amazon Nova Forge

Score 11

Fine-tuning for domain-specific tasks means improving performance in one area without degrading the model’s general capabilities, and getting that balance right is harder than it looks. This post walks through how to navigate that balance,...

benchmark model-release training fine-tuning

Open

High signal Matched: performance, model, training, checkpointing, fine-tuning

Lambda · cloud · 2026-06-03

Introducing workspaces for Lambda Cloud

Score 17

Lambda workspaces help teams organize cloud resources, control access, and separate dev, staging, and production in shared GPU environments. A junior researcher kills a production training run. A contractor sees weights they shouldn't. If...

hardware model-release cloud training

Open

High signal Matched: gpu, introducing, weights, cloud, training

AWS Machine Learning Blog · cloud · 2026-06-02

Extending MCP support for Amazon Bedrock AgentCore Gateway

Score 11

While deploying Model Context Protocol (MCP) servers in production, enterprises need fine-grained access control across servers, observability into which teams use which tools, security guarantees against data exfiltration, and centralized...

model-release cloud agents

Open

High signal Matched: model, bedrock, mcp

Lambda · cloud · 2026-06-01

Unbox one of NVIDIA's first co-packaged optics switches with us. See why we bet on CPO early.

Score 15

When we design large GPU clusters, the network is no longer a background system. It's part of the compute envelope. At the 800G and NVIDIA GB300 NVL72 scale, the back-end fabric accounts for 86% of networking power in a three-layer cluster...

inference serving distributed benchmark hardware model-release rag agents

Open

High signal Matched: generation, token generation, throughput, infiniband, gpu, model, retrieval, agentic

Nota AI · korea · 2026-05-29

Full-Stack Optimization for Low-Light Video on Jetson Orin NX: From 400 ms to 28 ms

Score 23

  Jaehoon Lee Technical Content Manager, Nota AI   When enterprises adopt AI, the most common bottleneck is not model development. It is the deployment stage: getting a finished model to run reliably on the actual target device.T...

inference serving benchmark hardware model-release research quantization evals

Open

High signal Matched: inference, throughput, benchmark, performance, latency, cost, gpu, model, evaluation, quantization, int8, benchmarks, leaderboard

AMD ROCm Blogs · hardware · 2026-05-29

Enabling Speculative Speculative Decoding on MI300X

Score 29

Speculative speculative decoding (SSD) [1] is a recently proposed speculative decoding (SD) algorithm that further accelerates large language model (LLM) inference beyond conventional SD. In standard SD, a small draft model proposes severa...

inference speculative-decoding benchmark hardware model-release

Open

High signal Matched: inference, decoding, speculative decoding, draft model, verification, cost, mi300x, model

AMD ROCm Blogs · hardware · 2026-05-27

Deep Dive Into 4-Wave Interleave FP8 GEMM

Score 17

Our previous two posts in this GEMM optimization series covered Matrix Core instructions and 8-wave ping-pong FP8 GEMM design. Here we discuss another algorithm design introduced by HipKittens - 4-wave interleave, which further improves th...

kernel benchmark model-release quantization

Open

High signal Matched: gemm, performance, fp8

AMD ROCm Blogs · hardware · 2026-05-25

AI Inference on AMD Ryzen™ AI Max Processor

Score 20

Local large language model (LLM) inference has rapidly evolved, but a persistent limitation remains: model size is constrained by available GPU memory. Discrete GPUs typically offer 8–24 GB of dedicated VRAM, which can limit the size of mo...

inference distributed hardware model-release cloud quantization evals

Open

High signal Matched: inference, multi-gpu, gpu, model, checkpoint, cloud, quantization, evaluate

Lambda · cloud · 2026-05-22

DeepSeek V4: the most expected open-source model ever released, and the quietest landing

Score 18

After 15 months of incremental updates, leaks, and rumored leaks, DeepSeek released version 4. It arrived without the fanfare R1 and R1-preview commanded in early 2025. That quiet reception is the most interesting thing about the release....

inference serving benchmark model-release open-source

Open

High signal Matched: inference, serving, performance, cost, release, model, open-source

SkyPilot · open-source · 2026-05-22

RL Doesn't Work on Slurm

Score 8

Online reinforcement learning for LLMs breaks Slurm's batch scheduling model. We'll discuss why, and what can be done about it.

model-release

Open

High signal Matched: model

AMD ROCm Blogs · hardware · 2026-05-22

From Build to Benchmark: ONNX Model Serving with Triton Inference Server on AMD GPUs

Score 30

Triton Inference Server is an open-source platform designed to streamline AI inferencing. It supports the deployment, scaling, and inference of trained models from multiple frameworks, including ONNX Runtime, TensorFlow, PyTorch, and other...

inference serving kernel triton benchmark model-release cloud open-source

Open

High signal Matched: inference, inferencing, serving, triton, benchmark, model, cloud, open-source

Lambda · cloud · 2026-05-20

Lambda’s NVIDIA HGX B200 on STAC-AI™ LANG6

Score 18

What the numbers mean for financial services Executive summary Lambda is the first to publish an audited STAC-AI™ LANG6 result on NVIDIA HGX B200, with independently verified performance data that Financial Services Industry (FSI) infrastr...

inference benchmark hardware model-release evals

Open

High signal Matched: inference, generation, performance, gpu, h200, b200, model, evaluating

PyTorch Foundation · open-source · 2026-05-14

PyTorch 2.12 Release Blog

Score 12

We are excited to announce the release of PyTorch® 2.12 (release notes)! The PyTorch 2.12 release features the following changes: Batched linalg.eigh on CUDA is up to 100x faster due...

kernel cuda model-release

Open

High signal Matched: cuda, release

Nota AI · korea · 2026-05-11

[NetsPresso® x AI Agents] Easier to Use, Even More Powerful

Score 52

  Jaehoon Lee Technical Content Manager, Nota AI   NetsPresso® now embraces AI agents. An easy-to-use interface sits on top of the validated pipeline that handles everything from model compression to device deployment.When a user...

inference serving kernel speculative-decoding moe benchmark hardware model-release research quantization evals agents api

Open

High signal Matched: inference, endpoint, kernel, verification, moe, benchmark, latency, cost, gpu, release, model, evaluation, quantization, quantized, int4, evaluate, benchmarks, swe-bench, mmlu, agent, agents, api

BAIR · research · 2026-05-08

Adaptive Parallel Reasoning: The Next Paradigm in Efficient Inference Scaling

Score 28

.apr-fig { text-align: center; margin: 1.35em 0; line-height: 1.4; } .apr-fig--wide img { display: inline-block; width: 100%; max-width: 100%; height: auto; vertical-align: middle; } .apr-fig--wide-0-8 { max-width: 80%; margin-left: auto;...

inference serving kv-cache speculative-decoding benchmark model-release research training fine-tuning evals long-context agents frontier-model

Open

High signal Matched: inference, decoding, prefill, generation, serve, throughput, kv cache, verification, performance, latency, cost, model, paper, research, evaluation, training, pretraining, sft, benchmarks, long context, context window, agentic, reasoning model

Together AI · inference-infra · 2026-05-08

Deploy and inference any model from HuggingFace

Score 20

Learn how to deploy any Hugging Face model in one session using Goose and Together's Dedicated Container Inference. Skip the setup complexity — one prompt gets your model running in a production-grade GPU environment on release day.

inference hardware model-release

Open

High signal Matched: inference, gpu, release, model

LMCache · open-source · 2026-05-05

Deepseek V4 explained, and why it matters to your wallet

Score 12

DeepSeek V4 — an open weight model that gives you the state-of-the-art intelligence, while potentially gives you much cheaper token price than its preceding model, DeepSeek V3.2. But how does DeepSeek v4 does that? Pre-requisite: attention...

kv-cache model-release

Open

High signal Matched: kv cache, lmcache, model

Nota AI · korea · 2026-04-29

[NVIDIA Nemotron Hackathon] Grand Prize Among 20 Teams: Behind Two Sleepless Days

Score 32

  Hancheol Park, Ph. D.AI Research Engineer, NetsPresso Tech, Nota AI Geonmin Kim, Ph. D.AI Research Engineer, NetsPresso Tech, Nota AI Geonho LeeEdge AI Engineer Intern, NetsPresso Tech, Nota AI Jaehoon Lee Technical Content Manager,...

inference moe benchmark model-release research korea training fine-tuning quantization evals agents

Open

High signal Matched: generation, moe, performance, model, weights, paper, research, evaluation, korea, korean, seoul, naver, training, fine-tuning, quantization, agent, agents, agentic

LMCache · open-source · 2026-04-23

LMCache on Amazon SageMaker HyperPod: Accelerating LLM Inference with Managed Tiered KV Cache

Score 30

Overview Large language model (LLM) inference performance depends heavily on how efficiently the system manages key-value (KV) cache — the stored attention states that allow the model to avoid recomputing previous tokens. As context length...

inference kv-cache benchmark hardware model-release cloud

Open

High signal Matched: inference, kv cache, lmcache, performance, latency, gpu, model, sagemaker

Nota AI · korea · 2026-04-22

[Deep Dive: NetsPresso®] From Quantization to Graph Optimization: A Step-by-Step Model Deployment Pipeline

Score 54

  Jaehoon Lee Technical Content Manager, Nota AI   Series Notice: NetsPresso® Technical Blog, Part 2In Part 1, we walked through a scenario of deploying Llama 3.2 1B on an edge device to illustrate the NetsPresso® workflow. The f...

inference kernel cuda benchmark hardware model-release research korea training quantization evals api open-source

Open

High signal Matched: inference, kernel, cuda, matmul, benchmark, performance, latency, cost, npu, model, weights, paper, research, evaluation, furiosa, training, quantization, int8, int4, awq, gptq, sdk, open-source

Nota AI · korea · 2026-04-08

[Overview: NetsPresso®] A Platform That Handles Everything from Model Optimization to Target Deployment

Score 36

  Jaehoon Lee Technical Content Manager, Nota AI   AI Model Optimization: Why Models Won't Run on HardwareThe Chip Is Ready, but the Model Won't DeployIf you have ever tried deploying an AI model onto your own chip, the following...

inference distributed kv-cache speculative-decoding benchmark hardware model-release research quantization evals

Open

High signal Matched: inference, multi-gpu, kv cache, verification, performance, latency, gpu, model, research, evaluation, quantization, quantized, awq, gptq, evaluate

Nota AI · korea · 2026-03-31

The Real Reason TurboQuant Shook the Market: AI Optimization Has Gone Mainstream

Score 46

  Jaehoon Lee Technical Content Manager, Nota AI   In March, a single official announcement from Google Research rocked trillions of won in the market capitalization of U.S. infrastructure and semiconductor stocks. The catalyst:...

inference serving kv-cache benchmark hardware model-release research training fine-tuning quantization agents frontier-model

Open

High signal Matched: inference, serving, generation, throughput, kv cache, benchmark, performance, cost, b200, blackwell, introducing, model, fp8, research, training, fine-tuning, quantization, quantized, agent, agentic, frontier model

vLLM Project · open-source · 2026-03-24

Model Runner V2: A Modular and Faster Core for vLLM

Score 12

We are excited to announce Model Runner V2 (MRV2), a ground-up re-implementation of the vLLM model runner. MRV2 delivers a cleaner, more modular, and more efficient execution core—with no API...

model-release api

Open

High signal Matched: model, api

Nota AI · korea · 2026-03-23

[GTC 2026 Recap] The Trillion-Dollar Inference Race Begins: How Nota AI Fills the Gap

Score 42

  Jaehoon Lee Technical Content Manager, Nota AI   GTC has evolved far beyond a technology conference, drawing attention from global economies and financial markets alike. This year, CEO Jensen Huang took the stage in his tradema...

inference serving kernel cuda kv-cache benchmark hardware model-release research cloud training long-context agents open-source

Open

High signal Matched: inference, prefill, generation, throughput, cuda, kv cache, performance, latency, cost, gpu, npu, launch, model, research, cloud, training, long-context, context window, agent, agents, agentic, open-source

Nota AI · korea · 2026-03-20

GenAI Everywhere: The Future of Edge AI Optimization with the New NetsPresso®

Score 26

  NP Product Team, Nota AI   The role of Edge AI is rapidly expanding.Offline voice assistants now carry on conversations in our daily lives, vehicles infer routes in real time, and smartphones generate images without a network c...

inference kv-cache moe benchmark model-release research korea quantization

Open

High signal Matched: inference, kv cache, moe, benchmark, performance, latency, cost, model, research, seoul, quantization

Nota AI · korea · 2026-03-13

NotaMoEQuantization: An MoE-Specific Quantization Method for Solar-Open-100B

Score 62

  Hancheol Park, Ph. D. AI Research Engineer, Nota AI Tairen PiaoAI Research Engineer, Nota AI Tae-Ho KimCTO & Co-Founder, Nota AI ✔️ Resource : The official quantized model of Solar-Open-100B, which passed the first round of Sout...

inference serving moe benchmark hardware model-release research korea training quantization evals long-context open-source

Open

High signal Matched: inference, serving, prefill, generation, throughput, moe, router, benchmark, performance, latency, ttft, tpot, blackwell, release, model, weights, open model, research, evaluation, korea, korean, upstage, training, post-training, quantization, quantized, int4, evaluate, benchmarks, mmlu, long-context

BAIR · research · 2026-03-13

Identifying Interactions at Scale for LLMs

Score 18

--> Understanding the behavior of complex machine learning systems, particularly Large Language Models (LLMs), is a critical challenge in modern artificial intelligence. Interpretability research aims to make the decision-making process mo...

inference serving benchmark model-release research training evals long-context rag

Open

High signal Matched: inference, serving, decoding, performance, cost, model, research, training, evaluate, mmlu, long-context, rag

llm-d · open-source · 2026-03-13

Predicted-Latency Based Scheduling for LLMs

Score 18

A lightweight ML model trained online from live traffic replaces manually tuned heuristic weights with direct latency predictions, achieving 43% improvement in P50 end-to-end latency and 70% improvement in TTFT on a production-realistic wo...

benchmark model-release

Open

High signal Matched: latency, ttft, model, weights

AIBrix · open-source · 2026-03-03

AIBrix v0.6.0 Release: Envoy Sidecar, Mixed LLM Workloads Routing, Routing Profiles, LoRA Delivery & New APIs

Score 28

🚀 AIBrix v0.6.0 Release Today we’re excited to announce AIBrix v0.6.0, a release that expands how you deploy and route inference traffic. Key highlights include: Envoy Sidecar Support – Run Envoy alongside the gateway-plugin without...

inference model-release fine-tuning rag api

Open

High signal Matched: inference, prefill, release, model, lora, rerank, api, openai-compatible

Together AI · inference-infra · 2026-03-02

Introducing Together AI’s new look

Score 14

We've refreshed our visual identity — designed with Pentagram to express how Together AI connects open-source innovation, systems research, and builders to unlock new possibilities.

model-release research open-source

Open

High signal Matched: introducing, research, open-source

Nota AI · korea · 2026-02-26

ERGO: Efficient High-Resolution Visual Understanding for Vision-Language Models

Score 24

  Jewon Lee | Wooksu Shin | Seungmin Yang | Ki-Ung Song | Donguk Lim | Jaeyeon Kim | Tae-Ho Kim |  Bo-Kyeong KimEdgeFM Team, Nota AI ✔️ Resources for more information: GitHub, ArXiv, Project Page, Demo.✔️ Accepted at ICLR 2026. &...

inference speculative-decoding benchmark model-release research training evals

Open

High signal Matched: inference, generation, verification, benchmark, performance, latency, cost, model, arxiv, evaluation, training, post-training, benchmarks

Together AI · inference-infra · 2026-02-02

Fine-tuning open LLM judges to outperform GPT-5.2

Score 14

Fine-tuned open-source LLM judges can outperform GPT-5.2 at evaluating model outputs. Using Direct Preference Optimization on just 5,400 preference pairs, we trained GPT-OSS 120B to beat GPT-5.2 on human preference alignment—at 15x lower c...

inference benchmark model-release fine-tuning evals open-source

Open

High signal Matched: inference, cost, model, fine-tuning, evaluating, open-source, oss

vLLM Project · open-source · 2026-01-31

Streaming Requests & Realtime API in vLLM

Score 12

Large language model inference has traditionally operated on a simple premise: the user submits a complete prompt (request), the model processes it, and returns a response (either streaming or at...

inference model-release api

Open

High signal Matched: inference, model, api

Together AI · inference-infra · 2026-01-26

DSGym: A holistic framework for evaluating and training data science agents

Score 18

Introducing DSGym—a holisti evaluation and training framework for LLM-based data science agents. Features 90+ bioinformatics tasks, 92 Kaggle competitions, and synthetic trajectory generation. Our 4B model achieves state-of-the-art perform...

inference benchmark model-release research training evals agents open-source

Open

High signal Matched: generation, performance, introducing, model, evaluation, training, evaluating, agents, open-source

Together AI · inference-infra · 2026-01-13

Learn how Cursor partnered with Together AI to deliver real-time, low-latency inference at scale

Score 24

Together AI teamed with Cursor to build the real-time inference stack that keeps in-editor agents fast and reliable. They productionized NVIDIA Blackwell (B200/GB200), tuning ARM hosts, kernels, and FP4/TensorRT quantization for low latenc...

inference benchmark hardware model-release quantization agents

Open

High signal Matched: inference, latency, b200, gb200, blackwell, model, quantization, agents

BAIR · research · 2026-01-10

Information-Driven Design of Imaging Systems

Score 12

An encoder (optical system) maps objects to noiseless images, which noise corrupts into measurements. Our information estimator uses only these noisy measurements and a noise model to quantify how well measurements distinguish objects. Man...

benchmark model-release research training evals

Open

High signal Matched: performance, model, paper, evaluation, training, evaluate

SqueezeBits · korea · 2025-12-24

Introducing rebellions ATOM™-MAX

Score 24

Introducing ATOM™-Max, rebellions’ next-generation NPU designed for high-performance AI inference. Learn how its runtime, profiling tools, and PyTorch-native integrations enable developers to run and serve models efficiently without sacrif...

inference serving benchmark hardware model-release korea

Open

High signal Matched: inference, generation, serve, performance, npu, introducing, rebellions

Nota AI · korea · 2025-12-19

NVIDIA Blackwell; The Impact of NVFP4 For LLM Inference

Score 74

  Seungmin YangEdgeFM Lead, Nota AI On this page ▾ SummaryWith the introduction of NVFP4—a new 4-bit floating point data type in NVIDIA’s Blackwell GPU architecture—LLM inference achieves markedly improved efficiency.Blackwell’s NVFP4...

inference serving kernel cuda distributed benchmark hardware model-release research training quantization evals rag

Open

High signal Matched: inference, serving, decoding, prefill, generation, token generation, throughput, kernel, gemm, cutlass, distributed, benchmark, performance, latency, ttft, tpot, tokens/sec, cost, gpu, blackwell, launch, model, weights, fp8, research, training, post-training, quantization, quantized, awq, benchmarks, mmlu, retrieval

vLLM Project · open-source · 2025-12-13

Diving into speculative decoding training support for vLLM with Speculators v0.3.0

Score 24

- Speculative decoding serves as an optimization to improve inference performance; however, training a unique draft model for each LLM can be difficult and time-consuming, while production-ready...

inference speculative-decoding benchmark model-release training

Open

High signal Matched: inference, decoding, speculative decoding, draft model, performance, model, training

Together AI · inference-infra · 2025-12-03

Introducing AutoJudge: Streamlined inference acceleration via automated dataset curation

Score 20

AutoJudge accelerates LLM inference by identifying which token mismatches actually matter. Using self-supervised learning to train a lightweight classifier, it accepts up to 40 draft tokens per cycle—delivering 1.5–2× speedups over standar...

inference speculative-decoding model-release

Open

High signal Matched: inference, decoding, speculative decoding, introducing

AIBrix · open-source · 2025-11-10

AIBrix v0.5.0 Release: Batch API, KVCache v1 Connector, and Enhanced P/D orchestration

Score 22

🚀 AIBrix v0.5.0 Release Today, we’re excited to announce AIBrix v0.5.0, a release that pushes AIBrix closer to a batteries-included control plane for modern LLM workloads. This release introduces an OpenAI-compatible Batch API for hi...

inference benchmark model-release research evals api

Open

High signal Matched: prefill, latency, release, evaluation, api, openai-compatible

BAIR · research · 2025-09-01

What exactly does word2vec learn?

Score 14

What exactly does word2vec learn, and how? Answering this question amounts to understanding representation learning in a minimal yet interesting language modeling task. Despite the fact that word2vec is a well-known precursor to modern lan...

benchmark model-release research training

Open

High signal Matched: benchmark, performance, model, weights, paper, training

AIBrix · open-source · 2025-08-05

AIBrix v0.4.0 Release: P/D Disaggregation and Expert Parallelism Support, KVCache v1 Connector, KV Event Synchronization & Multi‑Engine Support

Score 20

AIBrix is a composable, cloud‑native LLM inference infrastructure designed to deliver high performance and low cost at scale. We now present a major update in a new release - v0.4.0. This release tackles key bottlenecks in orchestration an...

inference serving benchmark hardware model-release cloud

Open

High signal Matched: inference, prefill, generation, token generation, throughput, performance, cost, gpu, release, cloud

Nota AI · korea · 2025-07-10

Video Self-Distillation for Single-Image Encoders: Learning Temporal Priors from Unlabeled Video

Score 20

  Marcel Simon, Ph. D.ML Researcher, Nota AI GmbH Tae-Ho KimCTO & Co-Founder, Nota AI Seul-Ki Yeom, Ph. D.Research Lead, Nota AI GmbH   SummaryProposes a simple next-frame prediction task using unlabeled video to enhance sing...

inference benchmark model-release research training fine-tuning evals

Open

High signal Matched: inference, performance, model, paper, research, training, fine-tuning, benchmarks

Replicate · inference-infra · 2025-07-07

Compare AI video models

Score 8

It's hard keeping up with every new video model. In this post we'll help you pick the best one for your needs.

model-release

Open

High signal Matched: model

BAIR · research · 2025-07-01

Whole-Body Conditioned Egocentric Video Prediction

Score 10

.modal { display: none; position: fixed; z-index: 9999; padding-top: 50px; left: 0; top: 0; width: 100%; height: 100%; overflow: auto; background-color: rgba(0,0,0,0.9); } .modal-content { margin: auto; display: block; max-width: 90%; max-...

inference benchmark model-release research training evals agents

Open

High signal Matched: inference, generation, performance, model, paper, arxiv, evaluation, training, evaluate, agent, agents

Modal · inference-infra · 2025-06-09

Introducing: Modal 1.0

Score 10

We've released v1.0 of the Modal client, marking a new milestone of maturity and stability for our platform.

model-release

Open

High signal Matched: introducing

AIBrix · open-source · 2025-05-22

AIBrix v0.3.0 Release: KVCache Offloading, Prefix Cache, Fairness Routing, and Benchmarking Tools

Score 24

AIBrix is a composable, cloud-native AI infrastructure toolkit designed to power scalable and cost-effective large language model (LLM) inference. As production demands for memory-efficient and latency-aware LLM services continue to grow,...

inference kv-cache benchmark model-release cloud

Open

High signal Matched: inference, prefix cache, latency, cost, release, model, cloud

llm-d · open-source · 2025-05-20

llm-d Press Release

Score 20

Red Hat launches llm-d: Open source distributed AI inference platform backed by NVIDIA, Google Cloud, IBM. Scale generative AI with intelligent routing on Kubernetes.

inference distributed model-release cloud open-source

Open

High signal Matched: inference, distributed, release, cloud, open source

Nota AI · korea · 2025-05-08

SplitQuant: Layer Splitting for Low-Bit Neural Network Quantization for Edge AI Devices

Score 20

  Jaewoo SongSoftware Engineer, Nota AI   SummaryThis study proposes an AI model preprocessing method for improved quantization accuracies on edge AI devices which do not support advanced quantization methods due to their limitat...

benchmark model-release research quantization

Open

High signal Matched: performance, model, weights, research, quantization, int8, int4

Nota AI · korea · 2025-05-07

Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features</span#x3E;

Score 28

&nbsp; Jewon Lee | Ki-Ung Song | Seungmin Yang | Donguk Lim | Jaeyeon Kim | Wooksu Shin | Bo-Kyeong Kim | Tae-Ho KimEdgeFM Team, Nota AI Yong Jae Lee, Ph. D.Associate Professor, UW-Madison &nbsp; SummaryOur method, Trimmed-Llama, reduces t...

inference kv-cache benchmark model-release research training evals open-source

Open

High signal Matched: inference, generation, kv cache, benchmark, performance, latency, model, weights, research, training, benchmarks, open-source

BAIR · research · 2025-04-11

Defending against Prompt Injection with Structured Queries (StruQ) and Preference Optimization (SecAlign)

Score 10

Recent advances in Large Language Models (LLMs) enable exciting LLM-integrated applications. However, as LLMs have improved, so have the attacks against them. Prompt injection attack is listed as the #1 threat by OWASP to LLM-integrated ap...

benchmark model-release research training fine-tuning evals rag api frontier-model

Open

High signal Matched: cost, model, evaluation, training, dpo, fine-tuning, retrieval, api, sota

BAIR · research · 2025-04-08

Repurposing Protein Folding Models for Generation with Latent Diffusion

Score 20

PLAID is a multimodal generative model that simultaneously generates protein 1D sequence and 3D structure, by learning the latent space of protein folding models. The awarding of the 2024 Nobel Prize to AlphaFold2 marks an important moment...

inference benchmark model-release research training rag

Open

High signal Matched: inference, generation, cost, model, weights, research, training, retrieval

Nota AI · korea · 2025-04-08

UniForm: A Reuse Attention Mechanism for Efficient Transformers on Resource-Constrained Edge Devices

Score 24

&nbsp; Seul-Ki Yeom, Ph. D. Research Lead, Nota AI GmbH Tae-Ho KimCTO &amp; Co-Founder, Nota AI &nbsp; SummaryDelivers real-time AI performance on edge devices such as smartphones, IoT devices, and embedded systems.Introduces a novel "Reus...

inference kernel benchmark model-release research evals

Open

High signal Matched: inference, kernel, benchmark, performance, cost, introducing, model, paper, research, benchmarks

AIBrix · open-source · 2025-03-10

DeepSeek-R1 671B multi-host Deployment in AIBrix

Score 20

This blog post introduces deploying DeepSeek R1 using AIBrix. DeepSeek-R1 demonstrates remarkable proficiency in reasoning tasks through step-by-step training process. It features 671B total parameters with 37B active parameters, and 128k...

inference distributed benchmark model-release training long-context

Open

High signal Matched: inference, distributed, benchmark, model, weights, training, context length

AIBrix · open-source · 2025-02-21

Introducing AIBrix: Cost-Effective and Scalable Control Plane for vLLM

Score 26

Open-source large language models (LLMs) like LLaMA, Deepseek, Qwen and Mistral etc have surged in popularity, offering enterprises greater flexibility, cost savings, and control over their AI deployments. These models have empowered organ...

inference benchmark model-release agents open-source

Open

High signal Matched: inference, generation, latency, cost, introducing, model, agents, open-source

AIBrix · open-source · 2025-02-19

AIBrix v0.2.0 Release: Distributed KV Cache, Orchestration and Heterogeneous GPU Support

Score 42

We&rsquo;re excited to announce the v0.2.0 release of AIBrix! Building on feedback from v0.1.0 production adoption and user interest, this release introduces several new features to enhance performance and usability. Extend the vLLM Prefix...

inference serving distributed kv-cache benchmark hardware model-release agents

Open

High signal Matched: inference, serving, prefill, throughput, distributed, multi-node, kv cache, prefix cache, performance, cost, gpu, accelerator, release, agent

AIBrix · open-source · 2024-11-13

Introducing AIBrix v0.1.0: Building the Future of Scalable, Cost-Effective AI Infrastructure for Large Models

Score 32

In recent years, large language models (LLMs) have revolutionized AI applications, powering solutions in areas like chatbots, automated content generation, and advanced recommendation engines. Services like OpenAI’s have gained significant...

inference kv-cache benchmark hardware model-release cloud open-source

Open

High signal Matched: decoding, prefill, generation, kv cache, performance, cost, gpu, release, introducing, cloud, open-source

Replicate · inference-infra · 2024-10-03

FLUX1.1 [pro] is here

Score 10

Black Forest Labs continue to push boundaries with their latest release of FLUX.1 image generation model.

inference model-release

Open

High signal Matched: generation, release, model

Nota AI · korea · 2024-08-02

Deploying an Efficient Vision-Language Model on Mobile Devices

Score 38

&nbsp; Jaeyeon KimResearch Engineer, Nota AI Geonmin KimResearch Engineer, Nota AI Hancheol ParkTeam Lead of NetsPresso Application, Nota AI &nbsp; IntroductionRecent large language models (LLMs) have demonstrated unprecedented performance...

inference benchmark model-release research cloud training fine-tuning evals open-source

Open

High signal Matched: decoding, benchmark, performance, latency, tokens/sec, model, arxiv, research, technical report, evaluation, cloud, training, lora, benchmarks, leaderboard, open-source

Replicate · inference-infra · 2024-06-12

Run Stable Diffusion 3 with an API

Score 8

Stable Diffusion 3 is the latest text-to-image model from Stability, with improved image quality, typography, prompt understanding, and resource efficiency. Learn how to run it in the cloud with one line of code.

model-release cloud api

Open

High signal Matched: model, cloud, api

Modal · inference-infra · 2024-02-27

Introducing: WebSockets on Modal

Score 10

Modal now supports WebSocket connections, enabling real-time, bidirectional data transfer between client and server.

model-release

Open

High signal Matched: introducing

Replicate · inference-infra · 2023-07-27

Run Llama 2 with an API

Score 8

Llama 2 is the first open source language model of the same caliber as OpenAI’s models. Learn how to run it in the cloud with one line of code.

model-release cloud api open-source

Open

High signal Matched: model, cloud, api, open source

Hugging Face · open-source · 2022-12-20

Model Cards

Score 10

No feed summary available yet.

model-release

Open

High signal Matched: model

BAIR · research · 2025-11-01

RL without TD learning

Score 4

In this post, I’ll introduce a reinforcement learning (RL) algorithm based on an “alternative” paradigm: divide and conquer. Unlike traditional methods, this algorithm is not based on temporal difference (TD) learning (which has scalabilit...

benchmark model-release research training

Open

Watchlist Matched: benchmark, performance, model, paper, training

LY Corporation Tech Blog · korea · 2025-10-20

End to End Testing on PRs

Score 4

At LY Corporation we're constantly working to improve our pre-release test process and reduce the ri...

model-release

Open

Watchlist Matched: release

BAIR · research · 2025-03-25

Scaling Up Reinforcement Learning for Traffic Smoothing: A 100-AV Highway Deployment

Score 6

Training Diffusion Models with Reinforcement Learning We deployed 100 reinforcement learning (RL)-controlled cars into rush-hour highway traffic to smooth congestion and reduce fuel consumption for everyone. Our goal is to tackle "stop-and...

serving kernel benchmark model-release research training agents

Open

Watchlist Matched: throughput, kernel, performance, model, paper, training, agent, agents

Replicate · inference-infra · 2025-03-05

Wan2.1 parameter sweep

Score 6

We've been playing with Alibaba's WAN2.1 text-to-video model lately. What happens when you tweak those mysterious parameters? Let's find out.

model-release

Open

Watchlist Matched: model

Replicate · inference-infra · 2024-08-01

Run FLUX with an API

Score 6

FLUX.1 is a new text-to-image model from Black Forest Labs, the creators of Stable Diffusion, that exceeds the capabilities of previous open-source models.

model-release api open-source

Open

Watchlist Matched: model, api, open-source

Replicate · inference-infra · 2023-08-22

Painting with words: a history of text-to-image AI

Score 6

With the recent release of Stable Diffusion XL fine-tuning on Replicate, and today being the 1-year anniversary of Stable Diffusion, now feels like the perfect opportunity to take a step back and reflect on how text-to-image AI has improve...

model-release fine-tuning

Open

Watchlist Matched: release, fine-tuning