benchmark

inference benchmark cloud agents

High signal Matched: performance, model, training, checkpointing, fine-tuning

AWS Machine Learning Blog · cloud · 2026-06-02

OpenAI models and Codex on Amazon Bedrock are now generally available

Score 13

GPT-5.5, GPT-5.4, and Codex are now generally available on Amazon Bedrock. Deploy them in production applications and agents today, on Bedrock’s high performance inference engine. 

inference serving distributed benchmark hardware model-release rag agents

High signal Matched: inference, performance, bedrock, agents

Lambda · cloud · 2026-06-01

Unbox one of NVIDIA's first co-packaged optics switches with us. See why we bet on CPO early.

Score 15

When we design large GPU clusters, the network is no longer a background system. It's part of the compute envelope. At the 800G and NVIDIA GB300 NVL72 scale, the back-end fabric accounts for 86% of networking power in a three-layer cluster...

benchmark hardware agents

High signal Matched: generation, token generation, throughput, infiniband, gpu, model, retrieval, agentic

AMD ROCm Blogs · hardware · 2026-06-01

Out-of-the-Box ROLL Support on AMD GPUs: Accelerating Reinforcement Learning at Scale

Score 13

Reinforcement learning (RL) is rapidly becoming a foundational technology for Large Language Models (LLMs)—powering key abilities such as reasoning and agentic behaviors. As RL workloads grow more complex and computationally intensive, the...

High signal Matched: performance, gpu, agentic

AMD ROCm Blogs · hardware · 2026-06-01

Performance Profiling on AMD GPUs - Part 4: Fortran OpenMP Offload Edition

Score 11

This blog, like the previous articles in the profiling guide series (Part 1, Part 2, and Part 3), is designed to help you systematically analyze and improve the performance of your Fortran OpenMP offload applications running on AMD GPUs. T...

inference serving benchmark hardware model-release research quantization evals

High signal Matched: performance

Nota AI · korea · 2026-05-29

Full-Stack Optimization for Low-Light Video on Jetson Orin NX: From 400 ms to 28 ms

Score 23

  Jaehoon Lee Technical Content Manager, Nota AI   When enterprises adopt AI, the most common bottleneck is not model development. It is the deployment stage: getting a finished model to run reliably on the actual target device.T...

High signal Matched: inference, throughput, benchmark, performance, latency, cost, gpu, model, evaluation, quantization, int8, benchmarks, leaderboard

AWS Machine Learning Blog · cloud · 2026-05-29

Build a test suite that grows with your agent with dataset management in Amazon Bedrock AgentCore

Score 13

Datasets in AgentCore is in public preview. Agent evaluation is most powerful when you combine fast-moving online signals with stable offline baselines. To understand whether your agent is truly improving over time, you need a fixed benchm...

benchmark research cloud evals agents

inference speculative-decoding benchmark hardware model-release

High signal Matched: benchmark, evaluation, bedrock, agent

AMD ROCm Blogs · hardware · 2026-05-29

Enabling Speculative Speculative Decoding on MI300X

Score 29

Speculative speculative decoding (SSD) [1] is a recently proposed speculative decoding (SD) algorithm that further accelerates large language model (LLM) inference beyond conventional SD. In standard SD, a small draft model proposes severa...

High signal Matched: inference, decoding, speculative decoding, draft model, verification, cost, mi300x, model

AMD ROCm Blogs · hardware · 2026-05-29

Running Variational Quantum Eigensolver with Qiskit Aer on AMD Instinct

Score 13

Quantum computing offers a fundamentally different approach to computational problems by leveraging quantum mechanical properties such as superposition and entanglement. Unlike a classical bit, which is always 0 or 1, a qubit can exist in...

inference benchmark hardware model-release agents

High signal Matched: benchmark, cost, gpu

PyTorch Foundation · open-source · 2026-05-28

Up to 580tps! New Speed Record of Qwen3.5-397B-A17B on GPU for Agentic Workloads with TokenSpeed

Score 17

TL;DR: The TokenSpeed inference engine achieved a record-breaking 580 tps running the Qwen3.5-397B-A17B model on GPUs. This extreme performance for agentic workloads is driven by systematic elimination of memory copies,...

inference benchmark agents

High signal Matched: inference, performance, gpu, model, agentic

vLLM Project · open-source · 2026-05-28

Accelerating Laguna XS.2 Inference with vLLM, Speculators, and LLM Compressor

Score 15

As organizations increasingly adopt AI-powered development tools, the need for high-performance agentic models that deliver both accuracy and operational efficiency has become critical. Laguna...

kernel benchmark model-release quantization

High signal Matched: inference, performance, agentic

AMD ROCm Blogs · hardware · 2026-05-27

Deep Dive Into 4-Wave Interleave FP8 GEMM

Score 17

Our previous two posts in this GEMM optimization series covered Matrix Core instructions and 8-wave ping-pong FP8 GEMM design. Here we discuss another algorithm design introduced by HipKittens - 4-wave interleave, which further improves th...

High signal Matched: gemm, performance, fp8

NVIDIA Technical Blog · hardware · 2026-05-26

Extract More Kernel Performance with NVIDIA CompileIQ Auto-Tuning

Score 17

NVIDIA CompileIQ tackles one of the hardest problems in performance engineering: finding the compiler options that unlock the best performance for a specific...

kernel benchmark

kernel cuda benchmark hardware

High signal Matched: kernel, performance

NVIDIA Technical Blog · hardware · 2026-05-26

Develop High-Performance GPU Kernels in C++ with NVIDIA CUDA Tile

Score 21

Developers can now use NVIDIA CUDA Tile programming within large existing C++  GPU codebases to develop highly optimized GPU kernels using tile-based...

kernel cuda benchmark hardware model-release

High signal Matched: cuda, performance, gpu

NVIDIA Technical Blog · hardware · 2026-05-26

NVIDIA CUDA 13.3 Enhances GPU Development with Tile Programming in C++, Compiler Autotuning, and Python Updates

Score 21

NVIDIA CUDA 13.3 brings new capabilities and performance optimizations to developers across the CUDA ecosystem. The launch of NVIDIA CUDA Tile programming in...

inference serving benchmark model-release open-source

High signal Matched: cuda, performance, gpu, launch

Lambda · cloud · 2026-05-22

DeepSeek V4: the most expected open-source model ever released, and the quietest landing

Score 18

After 15 months of incremental updates, leaks, and rumored leaks, DeepSeek released version 4. It arrived without the fanfare R1 and R1-preview commanded in early 2025. That quiet reception is the most interesting thing about the release....

inference serving kernel triton benchmark model-release cloud open-source

High signal Matched: inference, serving, performance, cost, release, model, open-source

AMD ROCm Blogs · hardware · 2026-05-22

From Build to Benchmark: ONNX Model Serving with Triton Inference Server on AMD GPUs

Score 30

Triton Inference Server is an open-source platform designed to streamline AI inferencing. It supports the deployment, scaling, and inference of trained models from multiple frameworks, including ONNX Runtime, TensorFlow, PyTorch, and other...

High signal Matched: inference, inferencing, serving, triton, benchmark, model, cloud, open-source

AMD ROCm Blogs · hardware · 2026-05-22

From Naive to Near-Peak: Building High-Performance GEMM Kernels with Gluon

Score 18

On a single MI355, our most-optimized FP16 GEMM kernel runs at 99% MFMA efficiency — the matrix engine sits idle for a handful of cycles per loop. Getting there took ten versions, a regression along the way, and a profiler open for the who...

kernel benchmark

High signal Matched: kernel, gemm, performance

Lambda · cloud · 2026-05-21

Lambda Bare Metal Instances: full hardware control with API-driven operations

Score 8

The unit of AI compute has shifted from single hosts to rack-scale systems that integrate NVIDIA GPUs, CPUs, scale-up networking fabrics, and liquid cooling, such as the NVIDIA GB300 NVL72 and NVIDIA Vera Rubin NVL72. Teams at the frontier...

inference serving benchmark cloud training api

High signal Matched: serving, performance, cloud, training, api

NVIDIA Technical Blog · hardware · 2026-05-21

Unlock Exascale Performance on NVIDIA GB200 NVL72 with Slurm Topology-Aware Job Scheduling

Score 16

As AI models grow in scale and complexity, realizing the full performance of modern accelerated infrastructure depends as much on how workloads are placed as on...

inference benchmark hardware model-release evals

High signal Matched: performance, gb200

Lambda · cloud · 2026-05-20

Lambda’s NVIDIA HGX B200 on STAC-AI™ LANG6

Score 18

What the numbers mean for financial services Executive summary Lambda is the first to publish an audited STAC-AI™ LANG6 result on NVIDIA HGX B200, with independently verified performance data that Financial Services Industry (FSI) infrastr...

benchmark hardware open-source

High signal Matched: inference, generation, performance, gpu, h200, b200, model, evaluating

AMD ROCm Blogs · hardware · 2026-05-20

ROCm 7.13: Expanding Hardware, Tools, and Reach

Score 14

AMD released ROCm Core 7.13, the AMD GPU Driver 31.30, and AMD GPU Virtualization 9.0. With these releases, ROCm software expands hardware support across enterprise datacenters. The platform introduces AMD’s latest Instinct accelerators, e...

inference benchmark model-release quantization

High signal Matched: performance, gpu, rocm, open-source

AMD ROCm Blogs · hardware · 2026-05-20

QuickReduce FP4 Quantization and Benchmarking on MI355

Score 12

Large Language Models (LLMs) typically contain billions — or even tens of billions — of parameters. During inference, tensor parallelism is commonly employed to distribute the workload across multiple GPUs. This approach demands frequent,...

benchmark model-release research evals agents

High signal Matched: inference, latency, introducing, quantization

NVIDIA Technical Blog · hardware · 2026-05-19

Mastering Agentic Techniques: AI Agent Evaluation

Score 16

Evaluating an AI model and evaluating an AI agent are related—but they answer fundamentally different questions. A model benchmark tests the capability of a...

inference benchmark evals agents

High signal Matched: benchmark, model, evaluation, evaluating, agent, agentic

Together AI · inference-infra · 2026-05-19

Benchmarking inference at scale: coding agents

Score 16

Real-world inference benchmarks for coding agents: 31% more TPS than TensorRT-LLM, 2× better TTFT at saturation, and 76% lower cost than Claude Opus 4.6.

inference serving benchmark model-release research api

High signal Matched: inference, ttft, cost, benchmarks, agents

Together AI · inference-infra · 2026-05-15

Together AI and Pearl Research Labs Team Up to Reduce the Cost of AI Inference

Score 24

Together AI partners with Pearl Research Labs to launch a discounted Pearl-powered inference endpoint for Gemma-4-31B-it-pearl, using Proof of Useful Work to turn AI workloads into crypto emissions.

benchmark research open-source

High signal Matched: inference, endpoint, cost, launch, research

Microsoft Research · big-tech · 2026-05-14

mimalloc: A new, high-performance, scalable memory allocator for the modern era

Score 8

mimalloc is an open-source, modern, scalable memory allocator that is a drop-in replacement for malloc and free. It is relatively small (~12K lines), with clear internal data structures, and is easy to build and integrate into other projec...

benchmark model-release research

High signal Matched: performance, research, open-source

Microsoft Research · big-tech · 2026-05-14

GridSFM: A new, small foundation model for the electric grid

Score 12

Introducing GridSFM, a small foundation model that can predict AC optimal power flow in milliseconds, boosting efficiency and unlocking cost savings. Learn how GridSFM gives grid operators direct visibility into congestion, stability, and...

inference serving kv-cache moe benchmark

High signal Matched: cost, introducing, model, research

vLLM Project · open-source · 2026-05-14

Elastic Expert Parallelism in vLLM

Score 16

Expert parallelism (EP) is a key technique for serving Mixture-of-Experts (MoE) models at high throughput. WideEP deployments (where EP spans many workers) maximize KV cache capacity, enabling...

benchmark model-release evals

High signal Matched: serving, throughput, kv cache, moe

AI2 · research · 2026-05-13

Introducing AIMIP: The AI weather and climate model intercomparison project

Score 14

AIMIP is a new open benchmark and dataset for evaluating AI climate models, showing they can match or beat conventional models on some historical climate metrics while still struggling to generalize reliably to long-term warming trends and...

High signal Matched: benchmark, introducing, model, evaluating

Cloudflare Blog · cloud · 2026-05-12

When "idle" isn't idle: how a Linux kernel optimization became a QUIC bug

Score 8

We investigated a bug where CUBIC's congestion window became pinned at its minimum floor, causing a performance to plummet. The fix involved correctly measuring idle periods to distinguish RTT wait times from actual application idleness.

kernel benchmark

inference serving kernel speculative-decoding moe benchmark hardware model-release research quantization evals agents api

High signal Matched: kernel, performance

Nota AI · korea · 2026-05-11

[NetsPresso® x AI Agents] Easier to Use, Even More Powerful

Score 52

  Jaehoon Lee Technical Content Manager, Nota AI   NetsPresso® now embraces AI agents. An easy-to-use interface sits on top of the validated pipeline that handles everything from model compression to device deployment.When a user...

benchmark hardware quantization

High signal Matched: inference, endpoint, kernel, verification, moe, benchmark, latency, cost, gpu, release, model, evaluation, quantization, quantized, int4, evaluate, benchmarks, swe-bench, mmlu, agent, agents, api

vLLM Project · open-source · 2026-05-11

A First Comprehensive Study of TurboQuant: Accuracy and Performance

Score 14

TurboQuant, a method for KV-cache quantization, recently gained significant traction in the community due to the large advertised savings in GPU memory from very low bit-width quantization of a...

inference serving kv-cache speculative-decoding benchmark model-release research training fine-tuning evals long-context agents frontier-model

High signal Matched: performance, gpu, quantization

BAIR · research · 2026-05-08

Adaptive Parallel Reasoning: The Next Paradigm in Efficient Inference Scaling

Score 28

.apr-fig { text-align: center; margin: 1.35em 0; line-height: 1.4; } .apr-fig--wide img { display: inline-block; width: 100%; max-width: 100%; height: auto; vertical-align: middle; } .apr-fig--wide-0-8 { max-width: 80%; margin-left: auto;...

moe benchmark model-release training

High signal Matched: inference, decoding, prefill, generation, serve, throughput, kv cache, verification, performance, latency, cost, model, paper, research, evaluation, training, pretraining, sft, benchmarks, long context, context window, agentic, reasoning model

AI2 · research · 2026-05-08

EMO: Pretraining mixture of experts for emergent modularity

Score 12

EMO is a new mixture-of-experts model trained so modular expert groups emerge from data, enabling users to select small task-specific expert subsets while preserving near full-model performance.

inference benchmark model-release training quantization

High signal Matched: mixture of experts, performance, model, pretraining

NVIDIA Technical Blog · hardware · 2026-05-07

Model Quantization: Post-Training Quantization Using NVIDIA Model Optimizer

Score 16

Model quantization is an effective method to reduce VRAM usage and improve inference performance on consumer devices such as NVIDIA GeForce RTX GPUs. By...

distributed benchmark hardware training

High signal Matched: inference, performance, model, training, post-training, quantization

NVIDIA Technical Blog · hardware · 2026-05-07

Real-Time Performance Monitoring and Faster Debugging with NCCL Inspector and Prometheus

Score 20

Distributed deep learning depends on fast, reliable GPU-to-GPU communication using the NVIDIA Collective Communication Library (NCCL). When training slows down,...

inference serving distributed kv-cache benchmark agents

High signal Matched: distributed, nccl, performance, gpu, training

vLLM Project · open-source · 2026-05-06

Serving Agentic Workloads at Scale with vLLM x Mooncake

Score 18

TL;DR: Agentic workloads generate massive shared prefixes that are often recomputed across turns. By integrating Mooncake's distributed KV cache store into vLLM, we achieve 3.8x higher throughput,...

High signal Matched: serving, throughput, distributed, kv cache, agentic

Modal · inference-infra · 2026-05-04

Boosting multimodal inference performance by >10% with a single Python dictionary

Score 16

If we've said it once, we've said it once per millisecond: never block the GPU.

serving benchmark model-release

High signal Matched: inference, performance, gpu

Cloudflare Blog · cloud · 2026-05-01

Introducing Dynamic Workflows: durable execution that follows the tenant

Score 10

Dynamic Workflows is a library that lets you route durable execution to tenant-provided code on the fly. Built on Dynamic Workers, it enables platforms to serve millions of unique workflows at near-zero idle cost.

High signal Matched: serve, cost, introducing

NVIDIA Technical Blog · hardware · 2026-04-30

Speed Up Unreal Engine NNE Inference with NVIDIA TensorRT for RTX Runtime

Score 14

Neural network techniques are increasingly used in computer graphics to boost image quality, improve performance, and streamline content creation. Approaches...

High signal Matched: inference, performance

Nota AI · korea · 2026-04-29

[NVIDIA Nemotron Hackathon] Grand Prize Among 20 Teams: Behind Two Sleepless Days

Score 32

  Hancheol Park, Ph. D.AI Research Engineer, NetsPresso Tech, Nota AI Geonmin Kim, Ph. D.AI Research Engineer, NetsPresso Tech, Nota AI Geonho LeeEdge AI Engineer Intern, NetsPresso Tech, Nota AI Jaehoon Lee Technical Content Manager,...

inference moe benchmark model-release research korea training fine-tuning quantization evals agents

inference kv-cache benchmark hardware model-release cloud

High signal Matched: generation, moe, performance, model, weights, paper, research, evaluation, korea, korean, seoul, naver, training, fine-tuning, quantization, agent, agents, agentic

LMCache · open-source · 2026-04-23

LMCache on Amazon SageMaker HyperPod: Accelerating LLM Inference with Managed Tiered KV Cache

Score 30

Overview Large language model (LLM) inference performance depends heavily on how efficiently the system manages key-value (KV) cache — the stored attention states that allow the model to avoid recomputing previous tokens. As context length...

High signal Matched: inference, kv cache, lmcache, performance, latency, gpu, model, sagemaker

Nota AI · korea · 2026-04-22

[Deep Dive: NetsPresso®] From Quantization to Graph Optimization: A Step-by-Step Model Deployment Pipeline

Score 54

  Jaehoon Lee Technical Content Manager, Nota AI   Series Notice: NetsPresso® Technical Blog, Part 2In Part 1, we walked through a scenario of deploying Llama 3.2 1B on an edge device to illustrate the NetsPresso® workflow. The f...

inference kernel cuda benchmark hardware model-release research korea training quantization evals api open-source

inference serving benchmark model-release training quantization

High signal Matched: inference, kernel, cuda, matmul, benchmark, performance, latency, cost, npu, model, weights, paper, research, evaluation, furiosa, training, quantization, int8, int4, awq, gptq, sdk, open-source

NVIDIA Technical Blog · hardware · 2026-04-20

Run High-Throughput Reinforcement Learning Training with End-to-End FP8 Precision

Score 18

As LLMs transition from simple text generation to complex reasoning, reinforcement learning (RL) plays a central role. Algorithms like Group Relative Policy...

benchmark model-release research training evals

High signal Matched: generation, throughput, fp8, training

BAIR · research · 2026-04-20

Gradient-based Planning for World Models at Longer Horizons

Score 16

.grasp-results-table table { font-size: 0.875rem; line-height: 1.35; width: 100%; } .grasp-results-table th, .grasp-results-table td { padding: 0.35rem 0.5rem; } /* Consistent whitespace between major sections (this post is long and hr-hea...

High signal Matched: performance, model, paper, arxiv, evaluation, training

Together AI · inference-infra · 2026-04-15

Parcae: Doing more with fewer parameters using stable looped models

Score 14

Parcae is a stable looped language model that matches the quality of a Transformer twice its size — a 770M model reaching 1.3B-level performance. We introduce the first scaling laws for looping and show that increasing recurrence, not just...

kernel cuda benchmark hardware

High signal Matched: performance, model

NVIDIA Technical Blog · hardware · 2026-04-14

NVIDIA NVbandwidth: Your Essential Tool for Measuring GPU Interconnect and Memory Performance

Score 18

When you’re writing CUDA applications, one of the most important things you need to focus on to write great code is data transfer performance. This applies to...

inference distributed kv-cache speculative-decoding benchmark hardware model-release research quantization evals

High signal Matched: cuda, performance, gpu

Nota AI · korea · 2026-04-08

[Overview: NetsPresso®] A Platform That Handles Everything from Model Optimization to Target Deployment

Score 36

  Jaehoon Lee Technical Content Manager, Nota AI   AI Model Optimization: Why Models Won't Run on HardwareThe Chip Is Ready, but the Model Won't DeployIf you have ever tried deploying an AI model onto your own chip, the following...

High signal Matched: inference, multi-gpu, kv cache, verification, performance, latency, gpu, model, research, evaluation, quantization, quantized, awq, gptq, evaluate

vLLM Project · open-source · 2026-04-07

Next-Level Inference: Why Your Single-Node vLLM Setup Needs Prefill-Decode Disaggregation

Score 22

TL;DR: Prefill and decode fight over the same GPUs, causing ITL spikes under load. We show how to disaggregate them on a single 8-GPU MI300X node using AMD's MORI-IO connector — achieving 2.5x...

High signal Matched: inference, prefill, itl, gpu, mi300x

LMCache · open-source · 2026-04-04

LMCache’s New Architecture Boosts MoE Inference Performance by 10×

Score 34

Modern LLM serving workloads are defined by strict latency requirements, high concurrency, and rapidly growing context lengths. Applications such as multi-turn chat, AI agents, and retrieval-augmented generation continuously build on prior...

inference serving kv-cache moe benchmark rag agents

serving benchmark hardware model-release

High signal Matched: inference, serving, decoding, generation, throughput, lmcache, moe, performance, latency, ttft, retrieval-augmented generation, retrieval, agents

NVIDIA Technical Blog · hardware · 2026-04-02

Accelerating Vision AI Pipelines with Batch Mode VC-6 and NVIDIA Nsight

Score 14

In vision AI systems, model throughput continues to improve. The surrounding pipeline stages must keep pace, including decode, preprocessing, and GPU...

High signal Matched: throughput, gpu, model

NVIDIA Technical Blog · hardware · 2026-04-02

Achieving Single-Digit Microsecond Latency Inference for Capital Markets

Score 8

In algorithmic trading, reducing response times to market events is crucial. To keep pace with high-speed electronic markets, latency-sensitive firms often use...

High signal Matched: inference, latency

Modular · inference-infra · 2026-04-02

Day Zero Launch: Fastest Performance for Gemma 4 on NVIDIA and AMD

Score 14

Day Zero Launch: Fastest Performance for Gemma 4 on NVIDIA and AMD

High signal Matched: performance, launch

NVIDIA Technical Blog · hardware · 2026-04-01

NVIDIA Platform Delivers Lowest Token Cost Enabled by Extreme Co-Design

Score 14

Co-designed hardware, software, and models are key to delivering the highest AI factory throughput and lowest token cost. Measuring this goes far beyond peak...

serving benchmark

High signal Matched: throughput, cost

NVIDIA Technical Blog · hardware · 2026-04-01

Accelerate Token Production in AI Factories Using Unified Services and Real-Time AI

Score 12

In today’s AI factory environment, performance is not theoretical. It is economic, competitive, and existential. A 1% drop in usable GPU time can mean...

inference serving kv-cache benchmark hardware model-release research training fine-tuning quantization agents frontier-model

High signal Matched: performance, gpu

Nota AI · korea · 2026-03-31

The Real Reason TurboQuant Shook the Market: AI Optimization Has Gone Mainstream

Score 46

  Jaehoon Lee Technical Content Manager, Nota AI   In March, a single official announcement from Google Research rocked trillions of won in the market capitalization of U.S. infrastructure and semiconductor stocks. The catalyst:...

High signal Matched: inference, serving, generation, throughput, kv cache, benchmark, performance, cost, b200, blackwell, introducing, model, fp8, research, training, fine-tuning, quantization, quantized, agent, agentic, frontier model

Together AI · inference-infra · 2026-03-26

Plan, divide, and conquer: How weak models excel at long context tasks

Score 10

As context windows grow, LLM performance degrades in unexpected ways. We show how a "Divide & Conquer" framework — breaking long documents into parallel chunks with a planner, workers, and manager — lets smaller models like Llama-3-70B and...

benchmark long-context

serving benchmark hardware model-release

High signal Matched: performance, long context

NVIDIA Technical Blog · hardware · 2026-03-25

Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads

Score 18

In production Kubernetes environments, the difference between model requirements and GPU size creates inefficiencies. Lightweight automatic speech recognition...

High signal Matched: throughput, gpu, model

NVIDIA Technical Blog · hardware · 2026-03-25

Scaling Token Factory Revenue and AI Efficiency by Maximizing Performance per Watt

Score 12

In the AI era, power is the ultimate constraint, and every AI factory operates within a hard limit. This makes performance per watt—the rate at which power is...

High signal Matched: performance

NVIDIA Technical Blog · hardware · 2026-03-23

NVIDIA IGX Thor Powers Industrial, Medical, and Robotics Edge AI Applications

Score 10

Industrial and medical systems are rapidly increasing the use of high-performance AI to improve worker productivity, human-machine interaction, and downtime...

High signal Matched: performance

Nota AI · korea · 2026-03-23

[GTC 2026 Recap] The Trillion-Dollar Inference Race Begins: How Nota AI Fills the Gap

Score 42

  Jaehoon Lee Technical Content Manager, Nota AI   GTC has evolved far beyond a technology conference, drawing attention from global economies and financial markets alike. This year, CEO Jensen Huang took the stage in his tradema...

inference serving kernel cuda kv-cache benchmark hardware model-release research cloud training long-context agents open-source

inference kv-cache moe benchmark model-release research korea quantization

High signal Matched: inference, prefill, generation, throughput, cuda, kv cache, performance, latency, cost, gpu, npu, launch, model, research, cloud, training, long-context, context window, agent, agents, agentic, open-source

Nota AI · korea · 2026-03-20

GenAI Everywhere: The Future of Edge AI Optimization with the New NetsPresso®

Score 26

  NP Product Team, Nota AI   The role of Edge AI is rapidly expanding.Offline voice assistants now carry on conversations in our daily lives, vehicles infer routes in real time, and smartphones generate images without a network c...

serving benchmark model-release training fine-tuning

High signal Matched: inference, kv cache, moe, benchmark, performance, latency, cost, model, research, seoul, quantization

Together AI · inference-infra · 2026-03-18

Together AI expands fine-tuning service with tool calling, reasoning, and vision support

Score 14

Together AI expands fine-tuning with native support for tool call, reasoning, and vision-language models, plus 100B+ model training, up to 6× higher throughput, and job cost and ETA estimates.

High signal Matched: throughput, cost, model, training, fine-tuning

Hugging Face · open-source · 2026-03-17

Holotron-12B - High Throughput Computer Use Agent

Score 10

No feed summary available yet.

serving benchmark agents

High signal Matched: throughput, agent, computer use

NVIDIA Technical Blog · hardware · 2026-03-16

NVIDIA Vera CPU Delivers High Performance, Bandwidth, and Efficiency for AI Factories

Score 12

AI is evolving, and reasoning models are increasing token demand, placing new requirements on every layer of AI infrastructure. More than ever, compute must...

High signal Matched: performance

NVIDIA Technical Blog · hardware · 2026-03-16

Inside NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin Platform

Score 20

NVIDIA Groq 3 LPX is a new rack-scale inference accelerator for the NVIDIA Vera Rubin platform, designed for the low-latency and large-context demands of...

inference serving moe benchmark hardware model-release research korea training quantization evals long-context open-source

High signal Matched: inference, latency, accelerator

Nota AI · korea · 2026-03-13

NotaMoEQuantization: An MoE-Specific Quantization Method for Solar-Open-100B

Score 62

  Hancheol Park, Ph. D. AI Research Engineer, Nota AI Tairen PiaoAI Research Engineer, Nota AI Tae-Ho KimCTO & Co-Founder, Nota AI ✔️ Resource : The official quantized model of Solar-Open-100B, which passed the first round of Sout...

inference serving benchmark model-release research training evals long-context rag

High signal Matched: inference, serving, prefill, generation, throughput, moe, router, benchmark, performance, latency, ttft, tpot, blackwell, release, model, weights, open model, research, evaluation, korea, korean, upstage, training, post-training, quantization, quantized, int4, evaluate, benchmarks, mmlu, long-context

BAIR · research · 2026-03-13

Identifying Interactions at Scale for LLMs

Score 18

--> Understanding the behavior of complex machine learning systems, particularly Large Language Models (LLMs), is a critical challenge in modern artificial intelligence. Interpretability research aims to make the decision-making process mo...

High signal Matched: inference, serving, decoding, performance, cost, model, research, training, evaluate, mmlu, long-context, rag

llm-d · open-source · 2026-03-13

Predicted-Latency Based Scheduling for LLMs

Score 18

A lightweight ML model trained online from live traffic replaces manually tuned heuristic weights with direct latency predictions, achieving 43% improvement in P50 end-to-end latency and 70% improvement in TTFT on a production-realistic wo...

High signal Matched: latency, ttft, model, weights

Together AI · inference-infra · 2026-03-12

Build real-time voice agents on Together AI

Score 10

Build real-time voice agents on Together AI with co-located STT, LLM, and TTS infrastructure, native Deepgram and Cartesia support, and end-to-end latency under 500ms.

serving kernel benchmark hardware

High signal Matched: latency, agents

Together AI · inference-infra · 2026-03-05

FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling

Score 20

As GPU throughput outpaces memory bandwidth, kernels must evolve. We introduce FlashAttention-4, featuring new pipelining for maximum overlap, 2-CTA MMA modes to reduce shared memory traffic, and a hardware-software hybrid approach to soft...

High signal Matched: throughput, kernel, flashattention, gpu

Modular · inference-infra · 2026-03-05

Structured Mojo Kernels Part 1 - Peak Performance, Half the Code

Score 10

Structured Mojo Kernels Part 1 - Peak Performance, Half the Code

inference serving benchmark long-context

High signal Matched: performance

Together AI · inference-infra · 2026-03-04

Cache-aware prefill–decode disaggregation (CPD) for up to 40% faster long-context LLM serving

Score 20

Serving long prompts doesn't have to mean slow responses. Learn how Together AI's CPD architecture separates warm and cold inference workloads to deliver 40% higher throughput and dramatically lower time-to-first-token for long-context LLM...

High signal Matched: inference, serving, prefill, throughput, long-context

vLLM Project · open-source · 2026-02-27

Beyond Porting: How vLLM Orchestrates High-Performance Inference on AMD ROCm

Score 20

For a long time, enabling AMD support meant "porting"; i.e. just making code run. That era is over.

inference speculative-decoding benchmark model-release research training evals

High signal Matched: inference, performance, rocm

Nota AI · korea · 2026-02-26

ERGO: Efficient High-Resolution Visual Understanding for Vision-Language Models

Score 24

High signal Matched: inference, generation, verification, benchmark, performance, latency, cost, model, arxiv, evaluation, training, post-training, benchmarks

SkyPilot · open-source · 2026-02-21

SkyPilot Admin Policies: Enforce GPU Governance Without Slowing Down AI Teams

Score 12

SkyPilot Admin Policies let you enforce cost controls, security rules, and compliance requirements automatically — without slowing down your engineering team.

inference benchmark training

High signal Matched: cost, gpu

Together AI · inference-infra · 2026-02-19

Consistency diffusion language models: Up to 14x faster inference without sacrificing quality

Score 14

Standard diffusion language models can't use KV caching and need too many refinement steps to be practical. CDLM fixes both with a post-training recipe that enables exact block-wise KV caching and trajectory-consistent step reduction — del...

serving moe benchmark hardware quantization

High signal Matched: inference, latency, training, post-training

vLLM Project · open-source · 2026-02-13

DeepSeek-V3.2 on GB300: Performance Breakthrough

Score 22

DeepSeek-V3.2 (NVFP4 + TP2)has been successfully and smoothly run on GB300 (SM103 - Blackwell Ultra). Leveraging FP4 quantization, it achieves a single-GPU throughput of 7360 TGS (tokens / GPU /...

High signal Matched: throughput, deepseek-v3, performance, gpu, blackwell, quantization

Google Research · big-tech · 2026-02-11

Scheduling in a changing world: Maximizing throughput with time-varying capacity

Score 8

Algorithms & Theory

serving benchmark

inference serving kv-cache benchmark hardware

High signal Matched: throughput

llm-d · open-source · 2026-02-10

Native KV Cache Offloading to Any Filesystem with llm-d

Score 20

llm-d's new filesystem backend offloads KV cache to shared storage, enabling cross-replica reuse and up to 16.8x faster TTFT — scaling inference throughput without GPU or CPU memory limits.

inference benchmark fine-tuning

High signal Matched: inference, throughput, kv cache, ttft, gpu

llm-d · open-source · 2026-02-04

llm-d 0.5: Sustaining Performance at Scale

Score 16

llm-d v0.5 introduces hierarchical KV-cache offloading, LoRA-aware scheduling, UCCL networking, and scale-to-zero autoscaling for sustained inference performance at scale.

inference serving benchmark hardware

High signal Matched: inference, performance, lora

vLLM Project · open-source · 2026-02-03

Driving vLLM WideEP and Large-Scale Serving Toward Maturity on Blackwell (Part I)

Score 24

Building on our previous work achieving 2.2k tok/s/H200 decode throughput with wide-EP, the vLLM team has continued performance optimization efforts targeting NVIDIA's GB200 platform. This blog...

inference benchmark model-release fine-tuning evals open-source

High signal Matched: serving, throughput, performance, h200, gb200, blackwell

Together AI · inference-infra · 2026-02-02

Fine-tuning open LLM judges to outperform GPT-5.2

Score 14

Fine-tuned open-source LLM judges can outperform GPT-5.2 at evaluating model outputs. Using Direct Preference Optimization on just 5,400 preference pairs, we trained GPT-OSS 120B to beat GPT-5.2 on human preference alignment—at 15x lower c...

High signal Matched: inference, cost, model, fine-tuning, evaluating, open-source, oss

Together AI · inference-infra · 2026-02-02

Together Evaluations now supports comparing top commercial APIs vs. open source models

Score 12

Together Evaluations now supports OpenAI, Anthropic, and Google models for cross-provider benchmarking. Compare open-source, fine-tuned, and proprietary models side-by-side to make data-driven decisions on quality, cost, and performance—al...

benchmark open-source

benchmark hardware model-release open-source

High signal Matched: performance, cost, open-source, open source

vLLM Project · open-source · 2026-02-01

GPT-OSS Performance Optimizations on NVIDIA Blackwell: Pushing the Pareto Frontier

Score 18

TL;DR: In collaboration with the open-source community, vLLM + NVIDIA has achieved significant performance milestones on the gpt-oss-120b model running on NVIDIA's Blackwell GPUs. Through deep...

inference benchmark model-release research training evals agents open-source

High signal Matched: performance, blackwell, model, open-source, oss

Together AI · inference-infra · 2026-01-26

DSGym: A holistic framework for evaluating and training data science agents

Score 18

Introducing DSGym—a holisti evaluation and training framework for LLM-based data science agents. Features 90+ bioinformatics tasks, 92 Kaggle competitions, and synthetic trajectory generation. Our 4B model achieves state-of-the-art perform...

inference serving benchmark hardware

High signal Matched: generation, performance, introducing, model, evaluation, training, evaluating, agents, open-source

Together AI · inference-infra · 2026-01-22

Optimizing inference speed and costs: Lessons learned from large-scale deployments

Score 22

Learn how to reduce inference latency without massive cost using proven inference optimization tactics — improving throughput, GPU utilization, and cost efficiency while balancing throughput vs. latency tradeoffs.

inference benchmark hardware model-release quantization agents

High signal Matched: inference, throughput, latency, cost, gpu

Together AI · inference-infra · 2026-01-13

Learn how Cursor partnered with Together AI to deliver real-time, low-latency inference at scale

Score 24

Together AI teamed with Cursor to build the real-time inference stack that keeps in-editor agents fast and reliable. They productionized NVIDIA Blackwell (B200/GB200), tuning ARM hosts, kernels, and FP4/TensorRT quantization for low latenc...

benchmark model-release research training evals

High signal Matched: inference, latency, b200, gb200, blackwell, model, quantization, agents

BAIR · research · 2026-01-10

Information-Driven Design of Imaging Systems

Score 12

An encoder (optical system) maps objects to noiseless images, which noise corrupts into measurements. Our information estimator uses only these noisy measurements and a noise model to quantify how well measurements distinguish objects. Man...

benchmark model-release evals open-source

High signal Matched: performance, model, paper, evaluation, training, evaluate

Together AI · inference-infra · 2026-01-08

How to choose the right open model for production

Score 20

Learn how to choose the right open-source model for production by evaluating model quality, benchmarking performance, and deploying open models that balance cost, speed, and accuracy.

inference serving kv-cache benchmark

High signal Matched: performance, cost, model, open model, evaluating, open-source

vLLM Project · open-source · 2026-01-08

Inside vLLM’s New KV Offloading Connector: Smarter Memory Transfer for Maximizing Inference Throughput

Score 18

In this post, we will describe the new KV cache offloading feature that was introduced in vLLM 0.11.0. We will focus on offloading to CPU memory (DRAM) and its benefits to improving overall...

inference serving benchmark hardware model-release korea

High signal Matched: inference, throughput, kv cache

SqueezeBits · korea · 2025-12-24

Introducing rebellions ATOM™-MAX

Score 24

Introducing ATOM™-Max, rebellions’ next-generation NPU designed for high-performance AI inference. Learn how its runtime, profiling tools, and PyTorch-native integrations enable developers to run and serve models efficiently without sacrif...

High signal Matched: inference, generation, serve, performance, npu, introducing, rebellions

Together AI · inference-infra · 2025-12-23

MiniMax Speech 2.6 Turbo now available natively on Together AI

Score 10

MiniMax Speech 2.6 Turbo: State-of-the-art multilingual TTS with human-level emotional awareness, sub-250ms latency, and 40+ languages—now on Together AI.

High signal Matched: latency

Nota AI · korea · 2025-12-19

NVIDIA Blackwell; The Impact of NVFP4 For LLM Inference

Score 74

  Seungmin YangEdgeFM Lead, Nota AI On this page ▾ SummaryWith the introduction of NVFP4—a new 4-bit floating point data type in NVIDIA’s Blackwell GPU architecture—LLM inference achieves markedly improved efficiency.Blackwell’s NVFP4...

inference serving kernel cuda distributed benchmark hardware model-release research training quantization evals rag

High signal Matched: inference, serving, decoding, prefill, generation, token generation, throughput, kernel, gemm, cutlass, distributed, benchmark, performance, latency, ttft, tpot, tokens/sec, cost, gpu, blackwell, launch, model, weights, fp8, research, training, post-training, quantization, quantized, awq, benchmarks, mmlu, retrieval

vLLM Project · open-source · 2025-12-19

vLLM-Omni Diffusion Cache Acceleration

Score 10

We are thrilled to announce a major performance update for vLLM-Omni.

High signal Matched: performance

Together AI · inference-infra · 2025-12-17

Research POV: Yes, AGI Can Happen – A Computational Perspective

Score 14

Dan Fu, our VP of Kernels, has published a new post challenging the idea that AI is hitting a hardware wall. He argues that we are vastly underutilizing current chips and that better software-hardware co-design will unlock the next order o...

benchmark research

High signal Matched: performance, research

vLLM Project · open-source · 2025-12-16

AMD × vLLM Semantic Router: Building the System Intelligence Together

Score 14

Over the past several months, AMD and the vLLM SR Team have been collaborating to bring vLLM Semantic Router (VSR) to AMD GPUs—not just as a performance optimization, but as a fundamental shift in...

moe benchmark

inference serving moe benchmark model-release

High signal Matched: router, performance

vLLM Project · open-source · 2025-12-13

vLLM Router: A High-Performance and Prefill/Decode Aware Load Balancer for Large-scale Serving

Score 26

Efficiently managing request distribution across a fleet of model replicas is a critical requirement for large-scale, production vLLM deployments. Standard load balancers often fall short as they...

inference speculative-decoding benchmark model-release training

High signal Matched: serving, prefill, router, performance, model

vLLM Project · open-source · 2025-12-13

Diving into speculative decoding training support for vLLM with Speculators v0.3.0

Score 24

- Speculative decoding serves as an optimization to improve inference performance; however, training a unique draft model for each LLM can be difficult and time-consuming, while production-ready...

High signal Matched: inference, decoding, speculative decoding, draft model, performance, model, training

SqueezeBits · korea · 2025-12-10

vLLM Hands-on Workshop with Rebellions & SqueezeBits: A Recap

Score 12

Rebellions and SqueezeBits Co-Host a vLLM Hands-on Workshop: Workshop Highlights, PyTorch Best Practices, Performance Optimization, and Developer First-Hand Tips!

benchmark korea

High signal Matched: performance, rebellions

Google Research · big-tech · 2025-12-04

From Waveforms to Wisdom: The New Benchmark for Auditory Intelligence

Score 8

Machine Intelligence

High signal Matched: benchmark

Modal · inference-infra · 2025-12-02

Modal + Mistral 3: 10x faster cold starts with GPU snapshotting

Score 12

We've partnered with Mistral to bring you Day 0 support for Mistral 3, with GPU-snapshot-optimized performance.

inference kv-cache speculative-decoding moe benchmark hardware frontier-model

High signal Matched: performance, gpu

llm-d · open-source · 2025-12-02

llm-d 0.4: Achieve SOTA Performance Across Accelerators

Score 30

llm-d v0.4 delivers 50% lower latency for MoE models via speculative decoding, expands TPU and XPU support, and adds prefix cache offloading for faster TTFT.

inference serving benchmark

High signal Matched: decoding, prefix cache, speculative decoding, moe, performance, latency, ttft, tpu, sota

AIBrix · open-source · 2025-11-26

PrisKV: A Colocated Tiered KVCache Store for LLM Serving

Score 22

In recent years, large language models (LLMs) such as GPT, DeepSeek, Doubao and Qwen have advanced rapidly and are reshaping a wide range of industries. As the Scaling Law continues to be validated and pushed to its limits, LLM capabilitie...

benchmark hardware model-release

High signal Matched: inference, serving, generation, throughput, performance, latency, cost

Modal · inference-infra · 2025-11-19

How Reducto improved enterprise-scale document processing latency by 3x

Score 14

Learn how Reducto used GPU memory snapshotting and flexible autoscaling to build fast multi-model pipelines.

High signal Matched: latency, gpu, model

Modal · inference-infra · 2025-11-13

How Decagon shipped real-time voice AI on Modal

Score 10

How Decagon and Modal made real-time voice AI possible, combining fine-tuned small models with a re-engineered inference runtime for sub-second latency.

inference benchmark model-release research evals api

High signal Matched: inference, latency

AIBrix · open-source · 2025-11-10

AIBrix v0.5.0 Release: Batch API, KVCache v1 Connector, and Enhanced P/D orchestration

Score 22

🚀 AIBrix v0.5.0 Release Today, we’re excited to announce AIBrix v0.5.0, a release that pushes AIBrix closer to a batteries-included control plane for modern LLM workloads. This release introduces an OpenAI-compatible Batch API for hi...

High signal Matched: prefill, latency, release, evaluation, api, openai-compatible

Modal · inference-infra · 2025-11-04

One-second voice-to-voice latency with Modal, Pipecat, and open models

Score 12

How we built a real-time voice bot on Modal's distributed serverless platform.

distributed benchmark

inference benchmark agents open-source

High signal Matched: distributed, latency

Together AI · inference-infra · 2025-11-04

Announcing the fastest inference for realtime voice AI agents

Score 14

Together AI launches the fastest voice AI stack: streaming Whisper STT, serverless open-source TTS (Orpheus & Kokoro), and Voxtral transcription. Sub-second latency for production voice agents.

High signal Matched: inference, latency, agents, open-source

Together AI · inference-infra · 2025-11-04

How to evaluate and benchmark Large Language Models (LLMs)

Score 12

Understanding how to evaluate and benchmark Large Language Models (LLMS). Test, compare, and understand LLMs.

benchmark evals

inference benchmark model-release

High signal Matched: benchmark, evaluate

SqueezeBits · korea · 2025-10-31

Winning both speed and quality: How Yetter deals with diffusion models

Score 16

Explore how the Yetter Inference Engine overcomes the limitations of step caching and model distillation for diffusion models. We analyze latency, diversity, quality, and negative-prompt handling to reveal what truly matters for scalable,...

High signal Matched: inference, generation, latency, model

Modal · inference-infra · 2025-10-29

Modal + Datalab: Deploy high-throughput document intelligence in <5 minutes

Score 10

We've collaborated with Datalab, the creators of Marker and Surya, to make it faster than ever to deploy document intelligence workflows.

serving benchmark

inference serving kernel benchmark hardware training

High signal Matched: throughput

SqueezeBits · korea · 2025-10-28

[Intel Gaudi] #6. GEMM, Attention, vLLM on Gaudi

Score 20

Explore how Intel’s new Gaudi-3 compares to Gaudi-2, NVIDIA A100, and H100. We analyze real-world GEMM efficiency, attention performance, and LLM serving results to uncover what truly matters for AI inference and training workloads.

High signal Matched: inference, serving, gemm, performance, h100, training

Together AI · inference-infra · 2025-10-22

Large Reasoning Models Fail to Follow Instructions During Reasoning: A Benchmark Study

Score 12

ReasonIF finds frontier LRMs fail to follow reasoning instructions >75% of the time; introduces a benchmark across languages, formatting, and length.

High signal Matched: benchmark

Modular · inference-infra · 2025-10-17

Achieving State-of-the-Art Performance on AMD MI355 — in Just 14 Days

Score 10

Achieving State-of-the-Art Performance on AMD MI355 — in Just 14 Days

inference moe benchmark hardware

High signal Matched: performance

Together AI · inference-infra · 2025-10-10

AdapTive-LeArning Speculator System (ATLAS): A New Paradigm in LLM Inference via Runtime-Learning Accelerators

Score 20

LLM inference that gets faster as you use it. Our runtime-learning accelerator adapts continuously to your workload, delivering 500 TPS on DeepSeek-V3.1, a 4x speedup over baseline performance without manual tuning.

High signal Matched: inference, deepseek-v3, performance, accelerator

llm-d · open-source · 2025-10-10

llm-d 0.3: Wider Well-Lit Paths for Scalable Inference

Score 20

llm-d v0.3 adds Google TPU and Intel XPU support, wide expert parallelism at 2.2k tokens/sec per GPU, predicted latency scheduling, and Inference Gateway GA.

High signal Matched: inference, latency, tokens/sec, gpu, tpu

SqueezeBits · korea · 2025-10-02

Yetter, the GenAI API service: AI Optimization, Out of the Box

Score 14

Meet 'Yetter': the generative AI API service built for speed, efficiency, and scalability. Powered by our optimization inference engine, it delivers reliable image, video, and future LLM services at a fraction of the cost.

inference benchmark api

inference serving distributed benchmark evals

High signal Matched: inference, cost, api

llm-d · open-source · 2025-09-24

KV-Cache Wins You Can See: From Prefix Caching in vLLM to Distributed Scheduling with llm-d

Score 18

See how llm-d's precise KV-cache aware scheduling delivers 57x faster responses and 2x throughput in production distributed LLM inference benchmarks.

High signal Matched: inference, throughput, distributed, benchmarks

SqueezeBits · korea · 2025-09-16

Guided Decoding Performance on vLLM and SGLang

Score 16

The guide to LLM guided decoding! This deep-dive benchmark compares XGrammar and LLGuidance on vLLM and SGLang to help you find the optimal setup for generating structured output based on your use case.

inference benchmark model-release api

High signal Matched: decoding, benchmark, performance

Together AI · inference-infra · 2025-09-15

Improved Batch Inference API: Enhanced UI, Expanded Model Support, and 3000× Rate Limit Increase

Score 18

Our new Batch Inference API makes large-scale AI workloads simpler, faster, and cheaper. With a streamlined UI, universal model support, and 3000× higher rate limits—now up to 30B tokens—you can process massive datasets at half the cost of...

benchmark hardware frontier-model

High signal Matched: inference, cost, model, api

Modular · inference-infra · 2025-09-12

Matrix Multiplication on Blackwell: Part 3 - The Optimizations Behind 85% of SOTA Performance

Score 14

Matrix Multiplication on Blackwell: Part 3 - The Optimizations Behind 85% of SOTA Performance

High signal Matched: performance, blackwell, sota

Modal · inference-infra · 2025-09-09

Introducing Notebooks

Score 12

A collaborative environment for high-performance interactive computing on GPUs.

inference serving benchmark

High signal Matched: performance, introducing

llm-d · open-source · 2025-09-03

Intelligent Inference Scheduling with llm-d

Score 16

Learn how llm-d's intelligent inference scheduling uses prefix-aware, load-balanced routing to maximize LLM throughput and minimize latency on Kubernetes.

benchmark model-release research training

High signal Matched: inference, throughput, latency

BAIR · research · 2025-09-01

What exactly does word2vec learn?

Score 14

What exactly does word2vec learn, and how? Answering this question amounts to understanding representation learning in a minimal yet interesting language modeling task. Despite the fact that word2vec is a well-known precursor to modern lan...

High signal Matched: benchmark, performance, model, weights, paper, training

Modal · inference-infra · 2025-08-28

How Zencastr transcribed hundreds of years worth of audio in just a few days

Score 8

Zencastr scaled up to 1,500 concurrent GPUs on Modal to process hundreds of years of podcast audio in just a few days. Today they run transcription, speaker detection, and audio enrichment for millions of podcast episodes on Modal, giving...

benchmark fine-tuning open-source

High signal Matched: cost

Together AI · inference-infra · 2025-08-19

Transform OpenAI gpt-oss Models into Domain Experts with Together AI Fine-Tuning

Score 10

Customize OpenAI’s gpt-oss-20B/120B with Together AI’s fine-tuning: train, optimize, and instantly deploy domain experts with enterprise reliability and cost efficiency.

inference serving benchmark hardware model-release cloud

High signal Matched: cost, fine-tuning, oss

AIBrix · open-source · 2025-08-05

AIBrix v0.4.0 Release: P/D Disaggregation and Expert Parallelism Support, KVCache v1 Connector, KV Event Synchronization & Multi‑Engine Support

Score 20

AIBrix is a composable, cloud‑native LLM inference infrastructure designed to deliver high performance and low cost at scale. We now present a major update in a new release - v0.4.0. This release tackles key bottlenecks in orchestration an...

High signal Matched: inference, prefill, generation, token generation, throughput, performance, cost, gpu, release, cloud

Hugging Face · open-source · 2025-08-01

📚 3LM: A Benchmark for Arabic LLMs in STEM and Code

Score 10

No feed summary available yet.

benchmark model-release open-source

High signal Matched: benchmark

Together AI · inference-infra · 2025-07-28

Together Evaluations: Benchmark Models for Your Tasks

Score 16

Together Evaluations is a flexible framework for benchmarking LLMs using strong open-source models as judges. Skip manual labeling and rigid metrics—get fast, customizable insights into model quality for your specific tasks.

High signal Matched: benchmark, model, open-source

SqueezeBits · korea · 2025-07-21

GraLoRA: Boosting Fine-Tuning Accuracy Without Extra Cost

Score 20

LoRA excels at efficient fine-tuning but suffers at higher ranks due to gradient entanglement. We introduce GraLoRA, which addresses these issues through finer-grained, block-wise updates, significantly enhancing performance and expressivi...

benchmark fine-tuning

distributed benchmark hardware cloud

High signal Matched: performance, cost, fine-tuning, lora

SkyPilot · open-source · 2025-07-16

The Evolution of AI Job Orchestration. Part 2: The AI-Native Control Plane & Orchestration that Finally Works for ML

Score 16

This is Part 2 of our series on the evolution of AI Job Orchestration. In Part 1, we explored how Neoclouds are democratizing GPU access but leaving the “last mile” unsolved. Now we’ll discover how AI-native orchestration...

benchmark model-release agents open-source

High signal Matched: infiniband, performance, cost, gpu, cloud

Together AI · inference-infra · 2025-07-14

Kimi K2: Leading Open-Source Model Now Available on Together AI

Score 16

Run Kimi K2 (1T params) on Together AI—frontier open model for agentic reasoning and coding, serverless deployment, 99.9% SLA, lower cost and instant scaling.

inference benchmark model-release research training fine-tuning evals

High signal Matched: cost, model, open model, agentic, open-source

Nota AI · korea · 2025-07-10

Video Self-Distillation for Single-Image Encoders: Learning Temporal Priors from Unlabeled Video

Score 20

  Marcel Simon, Ph. D.ML Researcher, Nota AI GmbH Tae-Ho KimCTO & Co-Founder, Nota AI Seul-Ki Yeom, Ph. D.Research Lead, Nota AI GmbH   SummaryProposes a simple next-frame prediction task using unlabeled video to enhance sing...

High signal Matched: inference, performance, model, paper, research, training, fine-tuning, benchmarks

Together AI · inference-infra · 2025-07-10

Together AI Launches Speech-to-Text: High-Performance Whisper APIs

Score 12

No feed summary available yet.

High signal Matched: performance

SkyPilot · open-source · 2025-07-08

The Evolution of AI Job Orchestration. Part 1: Running AI jobs on GPU Neoclouds

Score 12

If you’re an infrastructure or MLOps engineer at a large company, you know the drill. The ML team comes to you with requirements that change weekly. They need GPUs yesterday, but the budget was set six months ago. They want to use th...

High signal Matched: cost, gpu

SqueezeBits · korea · 2025-07-03

OwLite Meets Qualcomm Neural Network: Unlocking On-Device AI Performance

Score 10

At SqueezeBits we have been empowering developers to efficiently deploy complex AI models while minimizing performance trade-offs with OwLite toolkit. With OwLite v2.5, we're excited to announce official support for Qualcomm Neural Network...

High signal Matched: performance

SkyPilot · open-source · 2025-07-02

Managing Networks in the Chaotic Cloud and Kubernetes World

Score 12

Configure high-performance networking on different cloud providers and managed infrastructure with unified SkyPilot's network tier abstraction

inference benchmark model-release research training evals agents

High signal Matched: performance, cloud

BAIR · research · 2025-07-01

Whole-Body Conditioned Egocentric Video Prediction

Score 10

.modal { display: none; position: fixed; z-index: 9999; padding-top: 50px; left: 0; top: 0; width: 100%; height: 100%; overflow: auto; background-color: rgba(0,0,0,0.9); } .modal-content { margin: auto; display: block; max-width: 90%; max-...

High signal Matched: inference, generation, performance, model, paper, arxiv, evaluation, training, evaluate, agent, agents

Modal · inference-infra · 2025-06-18

Run FLUX.1-dev three times faster

Score 8

Price, performance, and control: pick three.

High signal Matched: performance

Hugging Face · open-source · 2025-06-12

How Long Prompts Block Other Requests - Optimizing LLM Performance

Score 10

No feed summary available yet.

benchmark model-release api

High signal Matched: performance

Together AI · inference-infra · 2025-06-11

Introducing the Together AI Batch API: Process Thousands of LLM Requests at 50% Lower Cost

Score 16

No feed summary available yet.

High signal Matched: cost, introducing, api

Modular · inference-infra · 2025-06-10

Modular + AMD: Unleashing AI performance on AMD GPUs

Score 10

Modular + AMD: Unleashing AI performance on AMD GPUs

inference kv-cache benchmark model-release cloud

High signal Matched: performance

AIBrix · open-source · 2025-05-22

AIBrix v0.3.0 Release: KVCache Offloading, Prefix Cache, Fairness Routing, and Benchmarking Tools

Score 24

AIBrix is a composable, cloud-native AI infrastructure toolkit designed to power scalable and cost-effective large language model (LLM) inference. As production demands for memory-efficient and latency-aware LLM services continue to grow,...

High signal Matched: inference, prefix cache, latency, cost, release, model, cloud

Hugging Face · open-source · 2025-05-21

Falcon-H1: A Family of Hybrid-Head Language Models Redefining Efficiency and Performance

Score 10

No feed summary available yet.

inference serving distributed benchmark model-release frontier-model

High signal Matched: performance

llm-d · open-source · 2025-05-20

Announcing the llm-d community!

Score 20

Introducing llm-d: Kubernetes-native distributed LLM inference with KV-cache routing, disaggregated serving, and SOTA performance per dollar. Built on vLLM.

High signal Matched: inference, serving, distributed, performance, introducing, sota

Replicate · inference-infra · 2025-05-16

NVIDIA H100 GPUs are here

Score 12

NVIDIA H100 GPUs are here, with better performance and lower cost.

benchmark model-release research quantization

High signal Matched: performance, cost, h100

Nota AI · korea · 2025-05-08

SplitQuant: Layer Splitting for Low-Bit Neural Network Quantization for Edge AI Devices

Score 20

  Jaewoo SongSoftware Engineer, Nota AI   SummaryThis study proposes an AI model preprocessing method for improved quantization accuracies on edge AI devices which do not support advanced quantization methods due to their limitat...

inference kv-cache benchmark model-release research training evals open-source

High signal Matched: performance, model, weights, research, quantization, int8, int4

Nota AI · korea · 2025-05-07

Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features</span#x3E;

Score 28

High signal Matched: inference, generation, kv cache, benchmark, performance, latency, model, weights, research, training, benchmarks, open-source

Hugging Face · open-source · 2025-04-16

Prefill and Decode for Concurrent Requests - Optimizing LLM Performance

Score 14

No feed summary available yet.

benchmark model-release research training fine-tuning evals rag api frontier-model

High signal Matched: prefill, performance

BAIR · research · 2025-04-11

Defending against Prompt Injection with Structured Queries (StruQ) and Preference Optimization (SecAlign)

Score 10

Recent advances in Large Language Models (LLMs) enable exciting LLM-integrated applications. However, as LLMs have improved, so have the attacks against them. Prompt injection attack is listed as the #1 threat by OWASP to LLM-integrated ap...

benchmark model-release quantization

High signal Matched: cost, model, evaluation, training, dpo, fine-tuning, retrieval, api, sota

SqueezeBits · korea · 2025-04-11

OwLite: No More Compromising on AI Performance After Quantization

Score 16

Discover how OwLite simplifies AI model optimization with seamless integration and secure architecture.

inference benchmark model-release research training rag

High signal Matched: performance, model, quantization

BAIR · research · 2025-04-08

Repurposing Protein Folding Models for Generation with Latent Diffusion

Score 20

PLAID is a multimodal generative model that simultaneously generates protein 1D sequence and 3D structure, by learning the latent space of protein folding models. The awarding of the 2024 Nobel Prize to AlphaFold2 marks an important moment...

inference kernel benchmark model-release research evals

High signal Matched: inference, generation, cost, model, weights, research, training, retrieval

Nota AI · korea · 2025-04-08

UniForm: A Reuse Attention Mechanism for Efficient Transformers on Resource-Constrained Edge Devices

Score 24

  Seul-Ki Yeom, Ph. D. Research Lead, Nota AI GmbH Tae-Ho KimCTO & Co-Founder, Nota AI   SummaryDelivers real-time AI performance on edge devices such as smartphones, IoT devices, and embedded systems.Introduces a novel "Reus...

benchmark model-release cloud training

High signal Matched: inference, kernel, benchmark, performance, cost, introducing, model, paper, research, benchmarks

SkyPilot · open-source · 2025-04-08

High-Performance Model Checkpointing on the Cloud

Score 18

Techniques to speed up checkpointing by 9.6x and how to easily achieve them in SkyPilot

High signal Matched: performance, model, cloud, checkpointing

Hugging Face · open-source · 2025-04-02

Efficient Request Queueing – Optimizing LLM Performance

Score 10

No feed summary available yet.

High signal Matched: performance

SqueezeBits · korea · 2025-03-26

TensorRT-LLM Goes Open Source!

Score 12

With TensorRT-LLM now open source, we can finally take a deep dive into the secret sauce behind its impressive performance.

benchmark open-source

inference distributed benchmark model-release training long-context

High signal Matched: performance, open source

AIBrix · open-source · 2025-03-10

DeepSeek-R1 671B multi-host Deployment in AIBrix

Score 20

This blog post introduces deploying DeepSeek R1 using AIBrix. DeepSeek-R1 demonstrates remarkable proficiency in reasoning tasks through step-by-step training process. It features 671B total parameters with 37B active parameters, and 128k...

High signal Matched: inference, distributed, benchmark, model, weights, training, context length

SkyPilot · open-source · 2025-03-05

Abusing SQLite to Handle Concurrency

Score 8

SkyPilot uses the venerable SQLite for state management. SQLite can handle millions of QPS, and terabytes of data. However, our efforts to scale our Managed Jobs feature ran up against the one downfall of SQLite: many concurrent writers. S...

High signal Matched: qps

SqueezeBits · korea · 2025-02-27

Fits on Chips: Saving LLM Costs Became Easier Than Ever

Score 10

This article introduces Fits on Chips, an LLMOps toolkit for performance evaluation.

benchmark research evals

inference benchmark model-release research training fine-tuning

High signal Matched: performance, evaluation

Nota AI · korea · 2025-02-25

A Study on Detecting LLM-Generated Multilingual Content

Score 18

  Hancheol Park, Ph. D.AI Research Engineer, Nota AI Geonmin Kim, Ph. D.AI Research Engineer, Nota AI Jaeyeon KimAI Research Engineer, Nota AI   SummaryIn this study, we propose a method for determining whether given multilingual...

inference benchmark model-release agents open-source

High signal Matched: generation, performance, model, paper, research, training, fine-tuning

AIBrix · open-source · 2025-02-21

Introducing AIBrix: Cost-Effective and Scalable Control Plane for vLLM

Score 26

Open-source large language models (LLMs) like LLaMA, Deepseek, Qwen and Mistral etc have surged in popularity, offering enterprises greater flexibility, cost savings, and control over their AI deployments. These models have empowered organ...

inference serving distributed kv-cache benchmark hardware model-release agents

High signal Matched: inference, generation, latency, cost, introducing, model, agents, open-source

AIBrix · open-source · 2025-02-19

AIBrix v0.2.0 Release: Distributed KV Cache, Orchestration and Heterogeneous GPU Support

Score 42

We’re excited to announce the v0.2.0 release of AIBrix! Building on feedback from v0.1.0 production adoption and user interest, this release introduces several new features to enhance performance and usability. Extend the vLLM Prefix...

benchmark research training evals

High signal Matched: inference, serving, prefill, throughput, distributed, multi-node, kv cache, prefix cache, performance, cost, gpu, accelerator, release, agent

Nota AI · korea · 2025-02-10

Where do LLMs Encode the Knowledge to Assess the Ambiguity?

Score 16

  Hancheol Park, Ph. D.AI Research Engineer, Nota AI Geonmin Kim, Ph. D.AI Research Engineer, Nota AI   SummaryIn this study, we present a method for detecting ambiguous samples in natural language understanding (NLU) tasks using...

High signal Matched: performance, paper, research, evaluation, training, evaluate

Hugging Face · open-source · 2025-02-04

DABStep: Data Agent Benchmark for Multi-step Reasoning

Score 10

No feed summary available yet.

benchmark hardware model-release quantization evals

High signal Matched: benchmark, agent

SqueezeBits · korea · 2025-01-13

[Intel Gaudi] #4. FP8 Quantization

Score 20

In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.

High signal Matched: performance, accelerator, fp8, quantization, evaluate

Hugging Face · open-source · 2025-01-09

CO₂ Emissions and Models Performance: Insights from the Open LLM Leaderboard

Score 10

No feed summary available yet.

benchmark evals

benchmark hardware research evals

High signal Matched: performance, leaderboard

SqueezeBits · korea · 2025-01-06

[Intel Gaudi] #3. Performance Evaluation with SynapseAI v1.19

Score 18

In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.

serving benchmark hardware frontier-model

High signal Matched: performance, accelerator, evaluation, evaluate

Modular · inference-infra · 2024-12-17

MAX GPU: State of the Art Throughput on a New GenAI platform

Score 14

MAX GPU: State of the Art Throughput on a New GenAI platform

High signal Matched: throughput, gpu, state of the art

Hugging Face · open-source · 2024-12-17

Benchmarking Language Model Performance on 5th Gen Xeon at GCP

Score 14

No feed summary available yet.

High signal Matched: performance, model

Hugging Face · open-source · 2024-12-04

Rethinking LLM Evaluation with 3C3H: AraGen Benchmark and Leaderboard

Score 14

No feed summary available yet.

benchmark research evals

High signal Matched: benchmark, evaluation, leaderboard

Hugging Face · open-source · 2024-12-03

Investing in Performance: Fine-tune small models with LLM insights - a CFM case study

Score 10

No feed summary available yet.

benchmark hardware research evals

High signal Matched: performance

SqueezeBits · korea · 2024-12-03

[Intel Gaudi] #2. Graph Compiler and Overall Performance Evaluation

Score 18

In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.

High signal Matched: performance, accelerator, evaluation, evaluate

SqueezeBits · korea · 2024-11-21

[Intel Gaudi] #1. Introduction

Score 12

In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.

benchmark hardware evals

High signal Matched: performance, accelerator, evaluate

Replicate · inference-infra · 2024-11-15

NVIDIA L40S GPUs are here

Score 8

NVIDIA L40S GPUs are here, with better performance and lower cost.

inference kv-cache benchmark hardware model-release cloud open-source

High signal Matched: performance, cost

AIBrix · open-source · 2024-11-13

Introducing AIBrix v0.1.0: Building the Future of Scalable, Cost-Effective AI Infrastructure for Large Models

Score 32

In recent years, large language models (LLMs) have revolutionized AI applications, powering solutions in areas like chatbots, automated content generation, and advanced recommendation engines. Services like OpenAI’s have gained significant...

High signal Matched: decoding, prefill, generation, kv cache, performance, cost, gpu, release, introducing, cloud, open-source

SqueezeBits · korea · 2024-10-30

[vLLM vs TensorRT-LLM] #5. Dynamic Sequence Lengths

Score 8

This article provides a comparative analysis of vLLM and TensorRT-LLM frameworks, focusing on performance with fixed and dynamic datasets.

High signal Matched: performance

SqueezeBits · korea · 2024-10-18

[vLLM vs TensorRT-LLM] #3. Understanding Sampling Methods and Their Performance Impact

Score 10

This article provides a comparative analysis of vLLM and TensorRT-LLM frameworks with various sampling methods.

inference serving benchmark research evals

High signal Matched: performance

SqueezeBits · korea · 2024-10-01

[vLLM vs TensorRT-LLM] #1. An Overall Evaluation

Score 22

This article provides a comparative analysis of vLLM and TensorRT-LLM frameworks for serving LLMs, evaluating their performance based on key metrics like throughput, TTFT, and TPOT to offer insights for practitioners in optimizing LLM depl...

inference serving benchmark model-release

High signal Matched: serving, throughput, performance, ttft, tpot, evaluation, evaluating

Modal · inference-infra · 2024-09-16

Boost your throughput with dynamic batching

Score 14

Learn how we used our new dynamic batching feature to improve throughput and reduce inference costs for the Whisper model with a single line of code!

High signal Matched: inference, throughput, model

Modular · inference-infra · 2024-09-13

MAX 24.5 - With SOTA CPU Performance for Llama 3.1

Score 10

MAX 24.5 - With SOTA CPU Performance for Llama 3.1

benchmark frontier-model

High signal Matched: performance, sota

Modal · inference-infra · 2024-09-10

Building a cost-effective analytics stack with Modal, dlt, and dbt

Score 10

A step-by-step guide to building a scalable analytics stack using Modal, dlt, and dbt for efficient data loading, transformation, and deployment.

inference benchmark model-release research cloud training fine-tuning evals open-source

High signal Matched: cost

Nota AI · korea · 2024-08-02

Deploying an Efficient Vision-Language Model on Mobile Devices

Score 38

  Jaeyeon KimResearch Engineer, Nota AI Geonmin KimResearch Engineer, Nota AI Hancheol ParkTeam Lead of NetsPresso Application, Nota AI   IntroductionRecent large language models (LLMs) have demonstrated unprecedented performance...

High signal Matched: decoding, benchmark, performance, latency, tokens/sec, model, arxiv, research, technical report, evaluation, cloud, training, lora, benchmarks, leaderboard, open-source

Modal · inference-infra · 2024-07-09

Product updates: Datadog integration, lower function latency & more

Score 10

Welcome to another round of Modal Product Updates! Here's what's new this month.

High signal Matched: latency

Hugging Face · open-source · 2024-07-01

Our Transformers Code Agent beats the GAIA benchmark 🏅

Score 10

No feed summary available yet.

High signal Matched: benchmark, agent

SqueezeBits · korea · 2024-06-26

How much can we save through compression?

Score 10

Estimating the cost savings from model compression.

benchmark model-release research evals

High signal Matched: cost, model

Nota AI · korea · 2024-06-13

Cluster Self-Refinement for Enhanced Online Multi-Camera People Tracking

Score 8

  Jeongho KimResearch Engineer, Nota AI   SummaryOnline multi-camera system for efficient individual trackingAccurate ID management with Cluster Self-Refinement (CSR)Improved performance with enhanced pose estimation   Intro...

High signal Matched: performance, model, paper, research, evaluation, leaderboard

Hugging Face · open-source · 2024-05-09

Building Cost-Efficient Enterprise RAG applications with Intel Gaudi 2 and Intel Xeon

Score 10

No feed summary available yet.

benchmark rag

High signal Matched: cost, rag

Hugging Face · open-source · 2024-05-03

Bringing the Artificial Analysis LLM Performance Leaderboard to Hugging Face

Score 10

No feed summary available yet.

benchmark evals

High signal Matched: performance, leaderboard

Modular · inference-infra · 2024-04-10

Row-major vs. Column-major Matrices: A Performance Analysis in Mojo and NumPy

Score 10

Row-major vs. Column-major Matrices: A Performance Analysis in Mojo and NumPy

inference serving benchmark model-release cloud

High signal Matched: performance

SkyPilot · open-source · 2024-02-20

Introducing SkyServe: 50% Cheaper AI Serving on Any Cloud with High Availability

Score 20

SkyServe: A simple, cost-efficient, multi-region/cloud library for serving GenAI models.

inference serving moe benchmark hardware model-release

High signal Matched: serving, cost, introducing, cloud

SkyPilot · open-source · 2023-12-21

Scaling Mixtral LLM Serving with High GPU Availability and Cost Efficiency

Score 24

A tutorial for serving Mixtral 8x7B model with SkyPilot and SkyServe.

benchmark model-release api open-source

High signal Matched: serving, mixtral, cost, gpu, model

Replicate · inference-infra · 2023-11-10

Using open-source models for faster and cheaper text embeddings

Score 10

An interactive example showing how to embed text using a state-of-the-art embedding model that beats OpenAI's embeddings API on price and performance.

High signal Matched: performance, model, api, open-source

Hugging Face · open-source · 2023-11-07

Comparing the Performance of LLMs: A Deep Dive into Roberta, Llama 2, and Mistral for Disaster Tweets Analysis with Lora

Score 10

No feed summary available yet.

benchmark fine-tuning

High signal Matched: performance, lora

Hugging Face · open-source · 2023-10-03

Chat Templates: An End to the Silent Performance Killer

Score 10

No feed summary available yet.

High signal Matched: performance

SkyPilot · open-source · 2023-09-27

Scaling AI Robotics on the Cloud

Score 12

Covariant runs AI on the cloud using SkyPilot, delivering models 4x faster cost-effectively.

High signal Matched: cost, cloud

Hugging Face · open-source · 2023-09-26

Llama 2 on Amazon SageMaker a Benchmark

Score 14

No feed summary available yet.

High signal Matched: benchmark, sagemaker

Hugging Face · open-source · 2023-09-01

Fetch Cuts ML Processing Latency by 50% Using Amazon SageMaker & Hugging Face

Score 14

No feed summary available yet.

High signal Matched: latency, sagemaker

Hugging Face · open-source · 2023-05-11

Assisted Generation: a new direction toward low-latency text generation

Score 14

No feed summary available yet.

benchmark model-release cloud

High signal Matched: generation, latency

SkyPilot · open-source · 2022-11-16

SkyPilot: ML and Data Science on any cloud with massive cost savings

Score 16

Introducing SkyPilot.

High signal Matched: cost, introducing, cloud

Hugging Face · open-source · 2022-10-19

MTEB: Massive Text Embedding Benchmark

Score 10

No feed summary available yet.

High signal Matched: benchmark

Hugging Face · open-source · 2022-01-13

Case Study: Millisecond Latency using Hugging Face Infinity and modern CPUs

Score 10

No feed summary available yet.

High signal Matched: latency

CoreWeave · cloud · 2026-06-03

(function () { const DOMScript = document.currentScript; document.addEventListener('DOMContentLoaded', function () { const DOM = { wrapper: DOMScript...

Score 3

No feed summary available yet.

Watchlist Matched: performance

AWS Machine Learning Blog · cloud · 2026-06-02

Amazon Quick integration with time-series databases for market intelligence using MCP

Score 7

In this post, we walk through a practical implementation using KDB-X MCP server integration with Amazon Quick, demonstrating how traders and analysts can ask questions using conversational language and receive actionable insights from data...

benchmark research agents

Watchlist Matched: performance, mcp

Microsoft Research · big-tech · 2026-05-22

MagenticLite, MagenticBrain, Fara1.5: An agentic experience optimized for small models

Score 6

MagenticLite is an agentic system for small models that works across the browser and local file system in a single workflow. It combines specialized models and orchestration to support efficient agentic performance on everyday tasks. The p...

Watchlist Matched: performance, research, agentic

Microsoft Research · big-tech · 2026-05-21

Vega: Zero-knowledge proofs for digital identity in the age of AI

Score 6

Vega turns a full credential into a single proof, sharing only what is needed and nothing more, with performance that works in real apps. The post Vega: Zero-knowledge proofs for digital identity in the age of AI appeared first on Microsof...

benchmark research

Watchlist Matched: performance, research

AI2 · research · 2026-05-19

OlmoEarth v1.1: A more efficient family of models

Score 6

OlmoEarth v1.1 is a more efficient family of remote-sensing models that cuts compute costs by up to 3x while maintaining similar performance, making large-scale satellite mapping faster and cheaper to run.

Watchlist Matched: performance

Cloudflare Blog · cloud · 2026-05-13

Browser Run: now running on Cloudflare Containers, it’s faster and more scalable

Score 4

We’ve enabled higher usage limits, faster performance, better reliability, and increased shipping velocity for our Browser Run product by rebuilding on top of Cloudflare’s Containers. Here’s how.

benchmark model-release training

Watchlist Matched: performance

Lambda · cloud · 2026-05-04

Most AI teams treat compute as a commodity. It's not.

Score 6

Consider two teams provisioning 8,192 GPUs for a large training run. Same model, same dataset, same budget. Team A lands on a facility purpose-built for AI with sufficient power density, carefully engineered liquid cooling, a high-performa...

Watchlist Matched: performance, model, training

AI2 · research · 2026-02-25

PreScience: Forecasting the future of science end-to-end

Score 0

PreScience is a new benchmark that evaluates whether AI can forecast how science unfolds end-to-end, from team formation through eventual impact.

Watchlist Matched: benchmark

LY Corporation Tech Blog · korea · 2026-01-14

Safety is a given, cost savings are a bonus: why AI services need dedicated guardrails

Score 6

Introduction: what are guardrails?Various mechanisms for making AI more safe to use are commonly ref...

benchmark model-release research training

Watchlist Matched: cost

BAIR · research · 2025-11-01

RL without TD learning

Score 4

In this post, I’ll introduce a reinforcement learning (RL) algorithm based on an “alternative” paradigm: divide and conquer. Unlike traditional methods, this algorithm is not based on temporal difference (TD) learning (which has scalabilit...

Watchlist Matched: benchmark, performance, model, paper, training

LY Corporation Tech Blog · korea · 2025-09-19

Improving code quality - Session 47: Breach of non-performance

Score 6

The original article was published on October 24, 2024.Hello, I'm Munetoshi Ishikawa, a mobile clien...

serving kernel benchmark model-release research training agents

Watchlist Matched: performance

BAIR · research · 2025-03-25

Scaling Up Reinforcement Learning for Traffic Smoothing: A 100-AV Highway Deployment

Score 6

Training Diffusion Models with Reinforcement Learning We deployed 100 reinforcement learning (RL)-controlled cars into rush-hour highway traffic to smooth congestion and reduce fuel consumption for everyone. Our goal is to tackle "stop-and...