Moreh · korea · 2026-06-03
Moreh vLLM Performance Evaluation: DeepSeek V3/R1 671B on AMD Instinct MI300X GPUs
No feed summary available yet.
High signal Matched: performance, mi300x, evaluation
Moreh · korea · 2026-06-03
No feed summary available yet.
High signal Matched: performance, mi300x, evaluation
Moreh · korea · 2026-06-03
No feed summary available yet.
High signal Matched: performance, mi300x, evaluation
Mooncake · open-source · 2026-06-03
No feed summary available yet.
High signal Matched: performance, benchmarks
Mooncake · open-source · 2026-06-03
No feed summary available yet.
High signal Matched: benchmark, performance
Mooncake · open-source · 2026-06-03
No feed summary available yet.
High signal Matched: hicache, benchmark
Mooncake · open-source · 2026-06-03
No feed summary available yet.
High signal Matched: performance
Mooncake · open-source · 2026-06-03
No feed summary available yet.
High signal Matched: benchmark
Mooncake · open-source · 2026-06-03
No feed summary available yet.
High signal Matched: performance
Mooncake · open-source · 2026-06-03
No feed summary available yet.
High signal Matched: performance
Perplexity Research · model-lab · 2026-06-03
No feed summary available yet.
High signal Matched: performance
Gcore · cloud · 2026-06-03
No feed summary available yet.
High signal Matched: performance
BentoML · inference-infra · 2026-06-03
No feed summary available yet.
High signal Matched: inference, serve, performance, model
OpenAI · model-lab · 2026-06-03
No feed summary available yet.
High signal Matched: cost
BentoML · inference-infra · 2026-06-03
No feed summary available yet.
High signal Matched: performance
Nebius · cloud · 2026-06-03
No feed summary available yet.
High signal Matched: performance, cloud, training
Runpod · cloud · 2026-06-03
No feed summary available yet.
High signal Matched: latency
FriendliAI · inference-infra · 2026-06-03
No feed summary available yet.
High signal Matched: inference, performance
FuriosaAI · hardware · 2026-06-03
No feed summary available yet.
High signal Matched: throughput, furiosa, sdk
Baseten · inference-infra · 2026-06-03
No feed summary available yet.
High signal Matched: performance, model
LightSeek Foundation · research · 2026-06-03
No feed summary available yet.
High signal Matched: inference, kernel, performance, agentic
FriendliAI · inference-infra · 2026-06-03
No feed summary available yet.
High signal Matched: cost
Cohere · model-lab · 2026-06-03
No feed summary available yet.
High signal Matched: performance, agentic
Databricks AI · big-tech · 2026-06-03
No feed summary available yet.
High signal Matched: performance
AWS Machine Learning Blog · cloud · 2026-06-03
Fine-tuning for domain-specific tasks means improving performance in one area without degrading the model’s general capabilities, and getting that balance right is harder than it looks. This post walks through how to navigate that balance,...
High signal Matched: performance, model, training, checkpointing, fine-tuning
AWS Machine Learning Blog · cloud · 2026-06-02
GPT-5.5, GPT-5.4, and Codex are now generally available on Amazon Bedrock. Deploy them in production applications and agents today, on Bedrock’s high performance inference engine.
High signal Matched: inference, performance, bedrock, agents
Lambda · cloud · 2026-06-01
When we design large GPU clusters, the network is no longer a background system. It's part of the compute envelope. At the 800G and NVIDIA GB300 NVL72 scale, the back-end fabric accounts for 86% of networking power in a three-layer cluster...
High signal Matched: generation, token generation, throughput, infiniband, gpu, model, retrieval, agentic
AMD ROCm Blogs · hardware · 2026-06-01
Reinforcement learning (RL) is rapidly becoming a foundational technology for Large Language Models (LLMs)—powering key abilities such as reasoning and agentic behaviors. As RL workloads grow more complex and computationally intensive, the...
High signal Matched: performance, gpu, agentic
AMD ROCm Blogs · hardware · 2026-06-01
This blog, like the previous articles in the profiling guide series (Part 1, Part 2, and Part 3), is designed to help you systematically analyze and improve the performance of your Fortran OpenMP offload applications running on AMD GPUs. T...
High signal Matched: performance
Nota AI · korea · 2026-05-29
Jaehoon Lee Technical Content Manager, Nota AI When enterprises adopt AI, the most common bottleneck is not model development. It is the deployment stage: getting a finished model to run reliably on the actual target device.T...
High signal Matched: inference, throughput, benchmark, performance, latency, cost, gpu, model, evaluation, quantization, int8, benchmarks, leaderboard
AWS Machine Learning Blog · cloud · 2026-05-29
Datasets in AgentCore is in public preview. Agent evaluation is most powerful when you combine fast-moving online signals with stable offline baselines. To understand whether your agent is truly improving over time, you need a fixed benchm...
High signal Matched: benchmark, evaluation, bedrock, agent
AMD ROCm Blogs · hardware · 2026-05-29
Speculative speculative decoding (SSD) [1] is a recently proposed speculative decoding (SD) algorithm that further accelerates large language model (LLM) inference beyond conventional SD. In standard SD, a small draft model proposes severa...
High signal Matched: inference, decoding, speculative decoding, draft model, verification, cost, mi300x, model
AMD ROCm Blogs · hardware · 2026-05-29
Quantum computing offers a fundamentally different approach to computational problems by leveraging quantum mechanical properties such as superposition and entanglement. Unlike a classical bit, which is always 0 or 1, a qubit can exist in...
High signal Matched: benchmark, cost, gpu
PyTorch Foundation · open-source · 2026-05-28
TL;DR: The TokenSpeed inference engine achieved a record-breaking 580 tps running the Qwen3.5-397B-A17B model on GPUs. This extreme performance for agentic workloads is driven by systematic elimination of memory copies,...
High signal Matched: inference, performance, gpu, model, agentic
vLLM Project · open-source · 2026-05-28
As organizations increasingly adopt AI-powered development tools, the need for high-performance agentic models that deliver both accuracy and operational efficiency has become critical. Laguna...
High signal Matched: inference, performance, agentic
AMD ROCm Blogs · hardware · 2026-05-27
Our previous two posts in this GEMM optimization series covered Matrix Core instructions and 8-wave ping-pong FP8 GEMM design. Here we discuss another algorithm design introduced by HipKittens - 4-wave interleave, which further improves th...
High signal Matched: gemm, performance, fp8
NVIDIA Technical Blog · hardware · 2026-05-26
NVIDIA CompileIQ tackles one of the hardest problems in performance engineering: finding the compiler options that unlock the best performance for a specific...
High signal Matched: kernel, performance
NVIDIA Technical Blog · hardware · 2026-05-26
Developers can now use NVIDIA CUDA Tile programming within large existing C++ GPU codebases to develop highly optimized GPU kernels using tile-based...
High signal Matched: cuda, performance, gpu
NVIDIA Technical Blog · hardware · 2026-05-26
NVIDIA CUDA 13.3 brings new capabilities and performance optimizations to developers across the CUDA ecosystem. The launch of NVIDIA CUDA Tile programming in...
High signal Matched: cuda, performance, gpu, launch
Lambda · cloud · 2026-05-22
After 15 months of incremental updates, leaks, and rumored leaks, DeepSeek released version 4. It arrived without the fanfare R1 and R1-preview commanded in early 2025. That quiet reception is the most interesting thing about the release....
High signal Matched: inference, serving, performance, cost, release, model, open-source
AMD ROCm Blogs · hardware · 2026-05-22
Triton Inference Server is an open-source platform designed to streamline AI inferencing. It supports the deployment, scaling, and inference of trained models from multiple frameworks, including ONNX Runtime, TensorFlow, PyTorch, and other...
High signal Matched: inference, inferencing, serving, triton, benchmark, model, cloud, open-source
AMD ROCm Blogs · hardware · 2026-05-22
On a single MI355, our most-optimized FP16 GEMM kernel runs at 99% MFMA efficiency — the matrix engine sits idle for a handful of cycles per loop. Getting there took ten versions, a regression along the way, and a profiler open for the who...
High signal Matched: kernel, gemm, performance
Lambda · cloud · 2026-05-21
The unit of AI compute has shifted from single hosts to rack-scale systems that integrate NVIDIA GPUs, CPUs, scale-up networking fabrics, and liquid cooling, such as the NVIDIA GB300 NVL72 and NVIDIA Vera Rubin NVL72. Teams at the frontier...
High signal Matched: serving, performance, cloud, training, api
NVIDIA Technical Blog · hardware · 2026-05-21
As AI models grow in scale and complexity, realizing the full performance of modern accelerated infrastructure depends as much on how workloads are placed as on...
High signal Matched: performance, gb200
Lambda · cloud · 2026-05-20
What the numbers mean for financial services Executive summary Lambda is the first to publish an audited STAC-AI™ LANG6 result on NVIDIA HGX B200, with independently verified performance data that Financial Services Industry (FSI) infrastr...
High signal Matched: inference, generation, performance, gpu, h200, b200, model, evaluating
AMD ROCm Blogs · hardware · 2026-05-20
AMD released ROCm Core 7.13, the AMD GPU Driver 31.30, and AMD GPU Virtualization 9.0. With these releases, ROCm software expands hardware support across enterprise datacenters. The platform introduces AMD’s latest Instinct accelerators, e...
High signal Matched: performance, gpu, rocm, open-source
AMD ROCm Blogs · hardware · 2026-05-20
Large Language Models (LLMs) typically contain billions — or even tens of billions — of parameters. During inference, tensor parallelism is commonly employed to distribute the workload across multiple GPUs. This approach demands frequent,...
High signal Matched: inference, latency, introducing, quantization
NVIDIA Technical Blog · hardware · 2026-05-19
Evaluating an AI model and evaluating an AI agent are related—but they answer fundamentally different questions. A model benchmark tests the capability of a...
High signal Matched: benchmark, model, evaluation, evaluating, agent, agentic
Together AI · inference-infra · 2026-05-19
Real-world inference benchmarks for coding agents: 31% more TPS than TensorRT-LLM, 2× better TTFT at saturation, and 76% lower cost than Claude Opus 4.6.
High signal Matched: inference, ttft, cost, benchmarks, agents
Together AI · inference-infra · 2026-05-15
Together AI partners with Pearl Research Labs to launch a discounted Pearl-powered inference endpoint for Gemma-4-31B-it-pearl, using Proof of Useful Work to turn AI workloads into crypto emissions.
High signal Matched: inference, endpoint, cost, launch, research
Microsoft Research · big-tech · 2026-05-14
mimalloc is an open-source, modern, scalable memory allocator that is a drop-in replacement for malloc and free. It is relatively small (~12K lines), with clear internal data structures, and is easy to build and integrate into other projec...
High signal Matched: performance, research, open-source
Microsoft Research · big-tech · 2026-05-14
Introducing GridSFM, a small foundation model that can predict AC optimal power flow in milliseconds, boosting efficiency and unlocking cost savings. Learn how GridSFM gives grid operators direct visibility into congestion, stability, and...
High signal Matched: cost, introducing, model, research
vLLM Project · open-source · 2026-05-14
Expert parallelism (EP) is a key technique for serving Mixture-of-Experts (MoE) models at high throughput. WideEP deployments (where EP spans many workers) maximize KV cache capacity, enabling...
High signal Matched: serving, throughput, kv cache, moe
AI2 · research · 2026-05-13
AIMIP is a new open benchmark and dataset for evaluating AI climate models, showing they can match or beat conventional models on some historical climate metrics while still struggling to generalize reliably to long-term warming trends and...
High signal Matched: benchmark, introducing, model, evaluating
Cloudflare Blog · cloud · 2026-05-12
We investigated a bug where CUBIC's congestion window became pinned at its minimum floor, causing a performance to plummet. The fix involved correctly measuring idle periods to distinguish RTT wait times from actual application idleness.
High signal Matched: kernel, performance
Nota AI · korea · 2026-05-11
Jaehoon Lee Technical Content Manager, Nota AI NetsPresso® now embraces AI agents. An easy-to-use interface sits on top of the validated pipeline that handles everything from model compression to device deployment.When a user...
High signal Matched: inference, endpoint, kernel, verification, moe, benchmark, latency, cost, gpu, release, model, evaluation, quantization, quantized, int4, evaluate, benchmarks, swe-bench, mmlu, agent, agents, api
vLLM Project · open-source · 2026-05-11
TurboQuant, a method for KV-cache quantization, recently gained significant traction in the community due to the large advertised savings in GPU memory from very low bit-width quantization of a...
High signal Matched: performance, gpu, quantization
BAIR · research · 2026-05-08
.apr-fig { text-align: center; margin: 1.35em 0; line-height: 1.4; } .apr-fig--wide img { display: inline-block; width: 100%; max-width: 100%; height: auto; vertical-align: middle; } .apr-fig--wide-0-8 { max-width: 80%; margin-left: auto;...
High signal Matched: inference, decoding, prefill, generation, serve, throughput, kv cache, verification, performance, latency, cost, model, paper, research, evaluation, training, pretraining, sft, benchmarks, long context, context window, agentic, reasoning model
AI2 · research · 2026-05-08
EMO is a new mixture-of-experts model trained so modular expert groups emerge from data, enabling users to select small task-specific expert subsets while preserving near full-model performance.
High signal Matched: mixture of experts, performance, model, pretraining
NVIDIA Technical Blog · hardware · 2026-05-07
Model quantization is an effective method to reduce VRAM usage and improve inference performance on consumer devices such as NVIDIA GeForce RTX GPUs. By...
High signal Matched: inference, performance, model, training, post-training, quantization
NVIDIA Technical Blog · hardware · 2026-05-07
Distributed deep learning depends on fast, reliable GPU-to-GPU communication using the NVIDIA Collective Communication Library (NCCL). When training slows down,...
High signal Matched: distributed, nccl, performance, gpu, training
vLLM Project · open-source · 2026-05-06
TL;DR: Agentic workloads generate massive shared prefixes that are often recomputed across turns. By integrating Mooncake's distributed KV cache store into vLLM, we achieve 3.8x higher throughput,...
High signal Matched: serving, throughput, distributed, kv cache, agentic
Modal · inference-infra · 2026-05-04
If we've said it once, we've said it once per millisecond: never block the GPU.
High signal Matched: inference, performance, gpu
Cloudflare Blog · cloud · 2026-05-01
Dynamic Workflows is a library that lets you route durable execution to tenant-provided code on the fly. Built on Dynamic Workers, it enables platforms to serve millions of unique workflows at near-zero idle cost.
High signal Matched: serve, cost, introducing
NVIDIA Technical Blog · hardware · 2026-04-30
Neural network techniques are increasingly used in computer graphics to boost image quality, improve performance, and streamline content creation. Approaches...
High signal Matched: inference, performance
Nota AI · korea · 2026-04-29
Hancheol Park, Ph. D.AI Research Engineer, NetsPresso Tech, Nota AI Geonmin Kim, Ph. D.AI Research Engineer, NetsPresso Tech, Nota AI Geonho LeeEdge AI Engineer Intern, NetsPresso Tech, Nota AI Jaehoon Lee Technical Content Manager,...
High signal Matched: generation, moe, performance, model, weights, paper, research, evaluation, korea, korean, seoul, naver, training, fine-tuning, quantization, agent, agents, agentic
LMCache · open-source · 2026-04-23
Overview Large language model (LLM) inference performance depends heavily on how efficiently the system manages key-value (KV) cache — the stored attention states that allow the model to avoid recomputing previous tokens. As context length...
High signal Matched: inference, kv cache, lmcache, performance, latency, gpu, model, sagemaker
Nota AI · korea · 2026-04-22
Jaehoon Lee Technical Content Manager, Nota AI Series Notice: NetsPresso® Technical Blog, Part 2In Part 1, we walked through a scenario of deploying Llama 3.2 1B on an edge device to illustrate the NetsPresso® workflow. The f...
High signal Matched: inference, kernel, cuda, matmul, benchmark, performance, latency, cost, npu, model, weights, paper, research, evaluation, furiosa, training, quantization, int8, int4, awq, gptq, sdk, open-source
NVIDIA Technical Blog · hardware · 2026-04-20
As LLMs transition from simple text generation to complex reasoning, reinforcement learning (RL) plays a central role. Algorithms like Group Relative Policy...
High signal Matched: generation, throughput, fp8, training
BAIR · research · 2026-04-20
.grasp-results-table table { font-size: 0.875rem; line-height: 1.35; width: 100%; } .grasp-results-table th, .grasp-results-table td { padding: 0.35rem 0.5rem; } /* Consistent whitespace between major sections (this post is long and hr-hea...
High signal Matched: performance, model, paper, arxiv, evaluation, training
Together AI · inference-infra · 2026-04-15
Parcae is a stable looped language model that matches the quality of a Transformer twice its size — a 770M model reaching 1.3B-level performance. We introduce the first scaling laws for looping and show that increasing recurrence, not just...
High signal Matched: performance, model
NVIDIA Technical Blog · hardware · 2026-04-14
When you’re writing CUDA applications, one of the most important things you need to focus on to write great code is data transfer performance. This applies to...
High signal Matched: cuda, performance, gpu
Nota AI · korea · 2026-04-08
Jaehoon Lee Technical Content Manager, Nota AI AI Model Optimization: Why Models Won't Run on HardwareThe Chip Is Ready, but the Model Won't DeployIf you have ever tried deploying an AI model onto your own chip, the following...
High signal Matched: inference, multi-gpu, kv cache, verification, performance, latency, gpu, model, research, evaluation, quantization, quantized, awq, gptq, evaluate
vLLM Project · open-source · 2026-04-07
TL;DR: Prefill and decode fight over the same GPUs, causing ITL spikes under load. We show how to disaggregate them on a single 8-GPU MI300X node using AMD's MORI-IO connector — achieving 2.5x...
High signal Matched: inference, prefill, itl, gpu, mi300x
LMCache · open-source · 2026-04-04
Modern LLM serving workloads are defined by strict latency requirements, high concurrency, and rapidly growing context lengths. Applications such as multi-turn chat, AI agents, and retrieval-augmented generation continuously build on prior...
High signal Matched: inference, serving, decoding, generation, throughput, lmcache, moe, performance, latency, ttft, retrieval-augmented generation, retrieval, agents
NVIDIA Technical Blog · hardware · 2026-04-02
In vision AI systems, model throughput continues to improve. The surrounding pipeline stages must keep pace, including decode, preprocessing, and GPU...
High signal Matched: throughput, gpu, model
NVIDIA Technical Blog · hardware · 2026-04-02
In algorithmic trading, reducing response times to market events is crucial. To keep pace with high-speed electronic markets, latency-sensitive firms often use...
High signal Matched: inference, latency
Modular · inference-infra · 2026-04-02
Day Zero Launch: Fastest Performance for Gemma 4 on NVIDIA and AMD
High signal Matched: performance, launch
NVIDIA Technical Blog · hardware · 2026-04-01
Co-designed hardware, software, and models are key to delivering the highest AI factory throughput and lowest token cost. Measuring this goes far beyond peak...
High signal Matched: throughput, cost
NVIDIA Technical Blog · hardware · 2026-04-01
In today’s AI factory environment, performance is not theoretical. It is economic, competitive, and existential. A 1% drop in usable GPU time can mean...
High signal Matched: performance, gpu
Nota AI · korea · 2026-03-31
Jaehoon Lee Technical Content Manager, Nota AI In March, a single official announcement from Google Research rocked trillions of won in the market capitalization of U.S. infrastructure and semiconductor stocks. The catalyst:...
High signal Matched: inference, serving, generation, throughput, kv cache, benchmark, performance, cost, b200, blackwell, introducing, model, fp8, research, training, fine-tuning, quantization, quantized, agent, agentic, frontier model
Together AI · inference-infra · 2026-03-26
As context windows grow, LLM performance degrades in unexpected ways. We show how a "Divide & Conquer" framework — breaking long documents into parallel chunks with a planner, workers, and manager — lets smaller models like Llama-3-70B and...
High signal Matched: performance, long context
NVIDIA Technical Blog · hardware · 2026-03-25
In production Kubernetes environments, the difference between model requirements and GPU size creates inefficiencies. Lightweight automatic speech recognition...
High signal Matched: throughput, gpu, model
NVIDIA Technical Blog · hardware · 2026-03-25
In the AI era, power is the ultimate constraint, and every AI factory operates within a hard limit. This makes performance per watt—the rate at which power is...
High signal Matched: performance
NVIDIA Technical Blog · hardware · 2026-03-23
Industrial and medical systems are rapidly increasing the use of high-performance AI to improve worker productivity, human-machine interaction, and downtime...
High signal Matched: performance
Nota AI · korea · 2026-03-23
Jaehoon Lee Technical Content Manager, Nota AI GTC has evolved far beyond a technology conference, drawing attention from global economies and financial markets alike. This year, CEO Jensen Huang took the stage in his tradema...
High signal Matched: inference, prefill, generation, throughput, cuda, kv cache, performance, latency, cost, gpu, npu, launch, model, research, cloud, training, long-context, context window, agent, agents, agentic, open-source
Nota AI · korea · 2026-03-20
NP Product Team, Nota AI The role of Edge AI is rapidly expanding.Offline voice assistants now carry on conversations in our daily lives, vehicles infer routes in real time, and smartphones generate images without a network c...
High signal Matched: inference, kv cache, moe, benchmark, performance, latency, cost, model, research, seoul, quantization
Together AI · inference-infra · 2026-03-18
Together AI expands fine-tuning with native support for tool call, reasoning, and vision-language models, plus 100B+ model training, up to 6× higher throughput, and job cost and ETA estimates.
High signal Matched: throughput, cost, model, training, fine-tuning
Hugging Face · open-source · 2026-03-17
No feed summary available yet.
High signal Matched: throughput, agent, computer use
NVIDIA Technical Blog · hardware · 2026-03-16
AI is evolving, and reasoning models are increasing token demand, placing new requirements on every layer of AI infrastructure. More than ever, compute must...
High signal Matched: performance
NVIDIA Technical Blog · hardware · 2026-03-16
NVIDIA Groq 3 LPX is a new rack-scale inference accelerator for the NVIDIA Vera Rubin platform, designed for the low-latency and large-context demands of...
High signal Matched: inference, latency, accelerator
Nota AI · korea · 2026-03-13
Hancheol Park, Ph. D. AI Research Engineer, Nota AI Tairen PiaoAI Research Engineer, Nota AI Tae-Ho KimCTO & Co-Founder, Nota AI ✔️ Resource : The official quantized model of Solar-Open-100B, which passed the first round of Sout...
High signal Matched: inference, serving, prefill, generation, throughput, moe, router, benchmark, performance, latency, ttft, tpot, blackwell, release, model, weights, open model, research, evaluation, korea, korean, upstage, training, post-training, quantization, quantized, int4, evaluate, benchmarks, mmlu, long-context
BAIR · research · 2026-03-13
--> Understanding the behavior of complex machine learning systems, particularly Large Language Models (LLMs), is a critical challenge in modern artificial intelligence. Interpretability research aims to make the decision-making process mo...
High signal Matched: inference, serving, decoding, performance, cost, model, research, training, evaluate, mmlu, long-context, rag
llm-d · open-source · 2026-03-13
A lightweight ML model trained online from live traffic replaces manually tuned heuristic weights with direct latency predictions, achieving 43% improvement in P50 end-to-end latency and 70% improvement in TTFT on a production-realistic wo...
High signal Matched: latency, ttft, model, weights
Together AI · inference-infra · 2026-03-12
Build real-time voice agents on Together AI with co-located STT, LLM, and TTS infrastructure, native Deepgram and Cartesia support, and end-to-end latency under 500ms.
High signal Matched: latency, agents
Together AI · inference-infra · 2026-03-05
As GPU throughput outpaces memory bandwidth, kernels must evolve. We introduce FlashAttention-4, featuring new pipelining for maximum overlap, 2-CTA MMA modes to reduce shared memory traffic, and a hardware-software hybrid approach to soft...
High signal Matched: throughput, kernel, flashattention, gpu
Modular · inference-infra · 2026-03-05
Structured Mojo Kernels Part 1 - Peak Performance, Half the Code
High signal Matched: performance
Together AI · inference-infra · 2026-03-04
Serving long prompts doesn't have to mean slow responses. Learn how Together AI's CPD architecture separates warm and cold inference workloads to deliver 40% higher throughput and dramatically lower time-to-first-token for long-context LLM...
High signal Matched: inference, serving, prefill, throughput, long-context
vLLM Project · open-source · 2026-02-27
For a long time, enabling AMD support meant "porting"; i.e. just making code run. That era is over.
High signal Matched: inference, performance, rocm
Nota AI · korea · 2026-02-26
Jewon Lee | Wooksu Shin | Seungmin Yang | Ki-Ung Song | Donguk Lim | Jaeyeon Kim | Tae-Ho Kim | Bo-Kyeong KimEdgeFM Team, Nota AI ✔️ Resources for more information: GitHub, ArXiv, Project Page, Demo.✔️ Accepted at ICLR 2026. &...
High signal Matched: inference, generation, verification, benchmark, performance, latency, cost, model, arxiv, evaluation, training, post-training, benchmarks
SkyPilot · open-source · 2026-02-21
SkyPilot Admin Policies let you enforce cost controls, security rules, and compliance requirements automatically — without slowing down your engineering team.
High signal Matched: cost, gpu
Together AI · inference-infra · 2026-02-19
Standard diffusion language models can't use KV caching and need too many refinement steps to be practical. CDLM fixes both with a post-training recipe that enables exact block-wise KV caching and trajectory-consistent step reduction — del...
High signal Matched: inference, latency, training, post-training
vLLM Project · open-source · 2026-02-13
DeepSeek-V3.2 (NVFP4 + TP2)has been successfully and smoothly run on GB300 (SM103 - Blackwell Ultra). Leveraging FP4 quantization, it achieves a single-GPU throughput of 7360 TGS (tokens / GPU /...
High signal Matched: throughput, deepseek-v3, performance, gpu, blackwell, quantization
Google Research · big-tech · 2026-02-11
Algorithms & Theory
High signal Matched: throughput
llm-d · open-source · 2026-02-10
llm-d's new filesystem backend offloads KV cache to shared storage, enabling cross-replica reuse and up to 16.8x faster TTFT — scaling inference throughput without GPU or CPU memory limits.
High signal Matched: inference, throughput, kv cache, ttft, gpu
llm-d · open-source · 2026-02-04
llm-d v0.5 introduces hierarchical KV-cache offloading, LoRA-aware scheduling, UCCL networking, and scale-to-zero autoscaling for sustained inference performance at scale.
High signal Matched: inference, performance, lora
vLLM Project · open-source · 2026-02-03
Building on our previous work achieving 2.2k tok/s/H200 decode throughput with wide-EP, the vLLM team has continued performance optimization efforts targeting NVIDIA's GB200 platform. This blog...
High signal Matched: serving, throughput, performance, h200, gb200, blackwell
Together AI · inference-infra · 2026-02-02
Fine-tuned open-source LLM judges can outperform GPT-5.2 at evaluating model outputs. Using Direct Preference Optimization on just 5,400 preference pairs, we trained GPT-OSS 120B to beat GPT-5.2 on human preference alignment—at 15x lower c...
High signal Matched: inference, cost, model, fine-tuning, evaluating, open-source, oss
Together AI · inference-infra · 2026-02-02
Together Evaluations now supports OpenAI, Anthropic, and Google models for cross-provider benchmarking. Compare open-source, fine-tuned, and proprietary models side-by-side to make data-driven decisions on quality, cost, and performance—al...
High signal Matched: performance, cost, open-source, open source
vLLM Project · open-source · 2026-02-01
TL;DR: In collaboration with the open-source community, vLLM + NVIDIA has achieved significant performance milestones on the gpt-oss-120b model running on NVIDIA's Blackwell GPUs. Through deep...
High signal Matched: performance, blackwell, model, open-source, oss
Together AI · inference-infra · 2026-01-26
Introducing DSGym—a holisti evaluation and training framework for LLM-based data science agents. Features 90+ bioinformatics tasks, 92 Kaggle competitions, and synthetic trajectory generation. Our 4B model achieves state-of-the-art perform...
High signal Matched: generation, performance, introducing, model, evaluation, training, evaluating, agents, open-source
Together AI · inference-infra · 2026-01-22
Learn how to reduce inference latency without massive cost using proven inference optimization tactics — improving throughput, GPU utilization, and cost efficiency while balancing throughput vs. latency tradeoffs.
High signal Matched: inference, throughput, latency, cost, gpu
Together AI · inference-infra · 2026-01-13
Together AI teamed with Cursor to build the real-time inference stack that keeps in-editor agents fast and reliable. They productionized NVIDIA Blackwell (B200/GB200), tuning ARM hosts, kernels, and FP4/TensorRT quantization for low latenc...
High signal Matched: inference, latency, b200, gb200, blackwell, model, quantization, agents
BAIR · research · 2026-01-10
An encoder (optical system) maps objects to noiseless images, which noise corrupts into measurements. Our information estimator uses only these noisy measurements and a noise model to quantify how well measurements distinguish objects. Man...
High signal Matched: performance, model, paper, evaluation, training, evaluate
Together AI · inference-infra · 2026-01-08
Learn how to choose the right open-source model for production by evaluating model quality, benchmarking performance, and deploying open models that balance cost, speed, and accuracy.
High signal Matched: performance, cost, model, open model, evaluating, open-source
vLLM Project · open-source · 2026-01-08
In this post, we will describe the new KV cache offloading feature that was introduced in vLLM 0.11.0. We will focus on offloading to CPU memory (DRAM) and its benefits to improving overall...
High signal Matched: inference, throughput, kv cache
SqueezeBits · korea · 2025-12-24
Introducing ATOM™-Max, rebellions’ next-generation NPU designed for high-performance AI inference. Learn how its runtime, profiling tools, and PyTorch-native integrations enable developers to run and serve models efficiently without sacrif...
High signal Matched: inference, generation, serve, performance, npu, introducing, rebellions
Together AI · inference-infra · 2025-12-23
MiniMax Speech 2.6 Turbo: State-of-the-art multilingual TTS with human-level emotional awareness, sub-250ms latency, and 40+ languages—now on Together AI.
High signal Matched: latency
Nota AI · korea · 2025-12-19
Seungmin YangEdgeFM Lead, Nota AI On this page ▾ SummaryWith the introduction of NVFP4—a new 4-bit floating point data type in NVIDIA’s Blackwell GPU architecture—LLM inference achieves markedly improved efficiency.Blackwell’s NVFP4...
High signal Matched: inference, serving, decoding, prefill, generation, token generation, throughput, kernel, gemm, cutlass, distributed, benchmark, performance, latency, ttft, tpot, tokens/sec, cost, gpu, blackwell, launch, model, weights, fp8, research, training, post-training, quantization, quantized, awq, benchmarks, mmlu, retrieval
vLLM Project · open-source · 2025-12-19
We are thrilled to announce a major performance update for vLLM-Omni.
High signal Matched: performance
Together AI · inference-infra · 2025-12-17
Dan Fu, our VP of Kernels, has published a new post challenging the idea that AI is hitting a hardware wall. He argues that we are vastly underutilizing current chips and that better software-hardware co-design will unlock the next order o...
High signal Matched: performance, research
vLLM Project · open-source · 2025-12-16
Over the past several months, AMD and the vLLM SR Team have been collaborating to bring vLLM Semantic Router (VSR) to AMD GPUs—not just as a performance optimization, but as a fundamental shift in...
High signal Matched: router, performance
vLLM Project · open-source · 2025-12-13
Efficiently managing request distribution across a fleet of model replicas is a critical requirement for large-scale, production vLLM deployments. Standard load balancers often fall short as they...
High signal Matched: serving, prefill, router, performance, model
vLLM Project · open-source · 2025-12-13
- Speculative decoding serves as an optimization to improve inference performance; however, training a unique draft model for each LLM can be difficult and time-consuming, while production-ready...
High signal Matched: inference, decoding, speculative decoding, draft model, performance, model, training
SqueezeBits · korea · 2025-12-10
Rebellions and SqueezeBits Co-Host a vLLM Hands-on Workshop: Workshop Highlights, PyTorch Best Practices, Performance Optimization, and Developer First-Hand Tips!
High signal Matched: performance, rebellions
Google Research · big-tech · 2025-12-04
Machine Intelligence
High signal Matched: benchmark
Modal · inference-infra · 2025-12-02
We've partnered with Mistral to bring you Day 0 support for Mistral 3, with GPU-snapshot-optimized performance.
High signal Matched: performance, gpu
llm-d · open-source · 2025-12-02
llm-d v0.4 delivers 50% lower latency for MoE models via speculative decoding, expands TPU and XPU support, and adds prefix cache offloading for faster TTFT.
High signal Matched: decoding, prefix cache, speculative decoding, moe, performance, latency, ttft, tpu, sota
AIBrix · open-source · 2025-11-26
In recent years, large language models (LLMs) such as GPT, DeepSeek, Doubao and Qwen have advanced rapidly and are reshaping a wide range of industries. As the Scaling Law continues to be validated and pushed to its limits, LLM capabilitie...
High signal Matched: inference, serving, generation, throughput, performance, latency, cost
Modal · inference-infra · 2025-11-19
Learn how Reducto used GPU memory snapshotting and flexible autoscaling to build fast multi-model pipelines.
High signal Matched: latency, gpu, model
Modal · inference-infra · 2025-11-13
How Decagon and Modal made real-time voice AI possible, combining fine-tuned small models with a re-engineered inference runtime for sub-second latency.
High signal Matched: inference, latency
AIBrix · open-source · 2025-11-10
🚀 AIBrix v0.5.0 Release Today, we’re excited to announce AIBrix v0.5.0, a release that pushes AIBrix closer to a batteries-included control plane for modern LLM workloads. This release introduces an OpenAI-compatible Batch API for hi...
High signal Matched: prefill, latency, release, evaluation, api, openai-compatible
Modal · inference-infra · 2025-11-04
How we built a real-time voice bot on Modal's distributed serverless platform.
High signal Matched: distributed, latency
Together AI · inference-infra · 2025-11-04
Together AI launches the fastest voice AI stack: streaming Whisper STT, serverless open-source TTS (Orpheus & Kokoro), and Voxtral transcription. Sub-second latency for production voice agents.
High signal Matched: inference, latency, agents, open-source
Together AI · inference-infra · 2025-11-04
Understanding how to evaluate and benchmark Large Language Models (LLMS). Test, compare, and understand LLMs.
High signal Matched: benchmark, evaluate
SqueezeBits · korea · 2025-10-31
Explore how the Yetter Inference Engine overcomes the limitations of step caching and model distillation for diffusion models. We analyze latency, diversity, quality, and negative-prompt handling to reveal what truly matters for scalable,...
High signal Matched: inference, generation, latency, model
Modal · inference-infra · 2025-10-29
We've collaborated with Datalab, the creators of Marker and Surya, to make it faster than ever to deploy document intelligence workflows.
High signal Matched: throughput
SqueezeBits · korea · 2025-10-28
Explore how Intel’s new Gaudi-3 compares to Gaudi-2, NVIDIA A100, and H100. We analyze real-world GEMM efficiency, attention performance, and LLM serving results to uncover what truly matters for AI inference and training workloads.
High signal Matched: inference, serving, gemm, performance, h100, training
Together AI · inference-infra · 2025-10-22
ReasonIF finds frontier LRMs fail to follow reasoning instructions >75% of the time; introduces a benchmark across languages, formatting, and length.
High signal Matched: benchmark
Modular · inference-infra · 2025-10-17
Achieving State-of-the-Art Performance on AMD MI355 — in Just 14 Days
High signal Matched: performance
Together AI · inference-infra · 2025-10-10
LLM inference that gets faster as you use it. Our runtime-learning accelerator adapts continuously to your workload, delivering 500 TPS on DeepSeek-V3.1, a 4x speedup over baseline performance without manual tuning.
High signal Matched: inference, deepseek-v3, performance, accelerator
llm-d · open-source · 2025-10-10
llm-d v0.3 adds Google TPU and Intel XPU support, wide expert parallelism at 2.2k tokens/sec per GPU, predicted latency scheduling, and Inference Gateway GA.
High signal Matched: inference, latency, tokens/sec, gpu, tpu
SqueezeBits · korea · 2025-10-02
Meet 'Yetter': the generative AI API service built for speed, efficiency, and scalability. Powered by our optimization inference engine, it delivers reliable image, video, and future LLM services at a fraction of the cost.
High signal Matched: inference, cost, api
llm-d · open-source · 2025-09-24
See how llm-d's precise KV-cache aware scheduling delivers 57x faster responses and 2x throughput in production distributed LLM inference benchmarks.
High signal Matched: inference, throughput, distributed, benchmarks
SqueezeBits · korea · 2025-09-16
The guide to LLM guided decoding! This deep-dive benchmark compares XGrammar and LLGuidance on vLLM and SGLang to help you find the optimal setup for generating structured output based on your use case.
High signal Matched: decoding, benchmark, performance
Together AI · inference-infra · 2025-09-15
Our new Batch Inference API makes large-scale AI workloads simpler, faster, and cheaper. With a streamlined UI, universal model support, and 3000× higher rate limits—now up to 30B tokens—you can process massive datasets at half the cost of...
High signal Matched: inference, cost, model, api
Modular · inference-infra · 2025-09-12
Matrix Multiplication on Blackwell: Part 3 - The Optimizations Behind 85% of SOTA Performance
High signal Matched: performance, blackwell, sota
Modal · inference-infra · 2025-09-09
A collaborative environment for high-performance interactive computing on GPUs.
High signal Matched: performance, introducing
llm-d · open-source · 2025-09-03
Learn how llm-d's intelligent inference scheduling uses prefix-aware, load-balanced routing to maximize LLM throughput and minimize latency on Kubernetes.
High signal Matched: inference, throughput, latency
BAIR · research · 2025-09-01
What exactly does word2vec learn, and how? Answering this question amounts to understanding representation learning in a minimal yet interesting language modeling task. Despite the fact that word2vec is a well-known precursor to modern lan...
High signal Matched: benchmark, performance, model, weights, paper, training
Modal · inference-infra · 2025-08-28
Zencastr scaled up to 1,500 concurrent GPUs on Modal to process hundreds of years of podcast audio in just a few days. Today they run transcription, speaker detection, and audio enrichment for millions of podcast episodes on Modal, giving...
High signal Matched: cost
Together AI · inference-infra · 2025-08-19
Customize OpenAI’s gpt-oss-20B/120B with Together AI’s fine-tuning: train, optimize, and instantly deploy domain experts with enterprise reliability and cost efficiency.
High signal Matched: cost, fine-tuning, oss
AIBrix · open-source · 2025-08-05
AIBrix is a composable, cloud‑native LLM inference infrastructure designed to deliver high performance and low cost at scale. We now present a major update in a new release - v0.4.0. This release tackles key bottlenecks in orchestration an...
High signal Matched: inference, prefill, generation, token generation, throughput, performance, cost, gpu, release, cloud
Hugging Face · open-source · 2025-08-01
No feed summary available yet.
High signal Matched: benchmark
Together AI · inference-infra · 2025-07-28
Together Evaluations is a flexible framework for benchmarking LLMs using strong open-source models as judges. Skip manual labeling and rigid metrics—get fast, customizable insights into model quality for your specific tasks.
High signal Matched: benchmark, model, open-source
SqueezeBits · korea · 2025-07-21
LoRA excels at efficient fine-tuning but suffers at higher ranks due to gradient entanglement. We introduce GraLoRA, which addresses these issues through finer-grained, block-wise updates, significantly enhancing performance and expressivi...
High signal Matched: performance, cost, fine-tuning, lora
SkyPilot · open-source · 2025-07-16
This is Part 2 of our series on the evolution of AI Job Orchestration. In Part 1, we explored how Neoclouds are democratizing GPU access but leaving the “last mile” unsolved. Now we’ll discover how AI-native orchestration...
High signal Matched: infiniband, performance, cost, gpu, cloud
Together AI · inference-infra · 2025-07-14
Run Kimi K2 (1T params) on Together AI—frontier open model for agentic reasoning and coding, serverless deployment, 99.9% SLA, lower cost and instant scaling.
High signal Matched: cost, model, open model, agentic, open-source
Nota AI · korea · 2025-07-10
Marcel Simon, Ph. D.ML Researcher, Nota AI GmbH Tae-Ho KimCTO & Co-Founder, Nota AI Seul-Ki Yeom, Ph. D.Research Lead, Nota AI GmbH SummaryProposes a simple next-frame prediction task using unlabeled video to enhance sing...
High signal Matched: inference, performance, model, paper, research, training, fine-tuning, benchmarks
Together AI · inference-infra · 2025-07-10
No feed summary available yet.
High signal Matched: performance
SkyPilot · open-source · 2025-07-08
If you’re an infrastructure or MLOps engineer at a large company, you know the drill. The ML team comes to you with requirements that change weekly. They need GPUs yesterday, but the budget was set six months ago. They want to use th...
High signal Matched: cost, gpu
SqueezeBits · korea · 2025-07-03
At SqueezeBits we have been empowering developers to efficiently deploy complex AI models while minimizing performance trade-offs with OwLite toolkit. With OwLite v2.5, we're excited to announce official support for Qualcomm Neural Network...
High signal Matched: performance
SkyPilot · open-source · 2025-07-02
Configure high-performance networking on different cloud providers and managed infrastructure with unified SkyPilot's network tier abstraction
High signal Matched: performance, cloud
BAIR · research · 2025-07-01
.modal { display: none; position: fixed; z-index: 9999; padding-top: 50px; left: 0; top: 0; width: 100%; height: 100%; overflow: auto; background-color: rgba(0,0,0,0.9); } .modal-content { margin: auto; display: block; max-width: 90%; max-...
High signal Matched: inference, generation, performance, model, paper, arxiv, evaluation, training, evaluate, agent, agents
Modal · inference-infra · 2025-06-18
Price, performance, and control: pick three.
High signal Matched: performance
Hugging Face · open-source · 2025-06-12
No feed summary available yet.
High signal Matched: performance
Together AI · inference-infra · 2025-06-11
No feed summary available yet.
High signal Matched: cost, introducing, api
Modular · inference-infra · 2025-06-10
Modular + AMD: Unleashing AI performance on AMD GPUs
High signal Matched: performance
AIBrix · open-source · 2025-05-22
AIBrix is a composable, cloud-native AI infrastructure toolkit designed to power scalable and cost-effective large language model (LLM) inference. As production demands for memory-efficient and latency-aware LLM services continue to grow,...
High signal Matched: inference, prefix cache, latency, cost, release, model, cloud
Hugging Face · open-source · 2025-05-21
No feed summary available yet.
High signal Matched: performance
llm-d · open-source · 2025-05-20
Introducing llm-d: Kubernetes-native distributed LLM inference with KV-cache routing, disaggregated serving, and SOTA performance per dollar. Built on vLLM.
High signal Matched: inference, serving, distributed, performance, introducing, sota
Replicate · inference-infra · 2025-05-16
NVIDIA H100 GPUs are here, with better performance and lower cost.
High signal Matched: performance, cost, h100
Nota AI · korea · 2025-05-08
Jaewoo SongSoftware Engineer, Nota AI SummaryThis study proposes an AI model preprocessing method for improved quantization accuracies on edge AI devices which do not support advanced quantization methods due to their limitat...
High signal Matched: performance, model, weights, research, quantization, int8, int4
Nota AI · korea · 2025-05-07
Jewon Lee | Ki-Ung Song | Seungmin Yang | Donguk Lim | Jaeyeon Kim | Wooksu Shin | Bo-Kyeong Kim | Tae-Ho KimEdgeFM Team, Nota AI Yong Jae Lee, Ph. D.Associate Professor, UW-Madison SummaryOur method, Trimmed-Llama, reduces t...
High signal Matched: inference, generation, kv cache, benchmark, performance, latency, model, weights, research, training, benchmarks, open-source
Hugging Face · open-source · 2025-04-16
No feed summary available yet.
High signal Matched: prefill, performance
BAIR · research · 2025-04-11
Recent advances in Large Language Models (LLMs) enable exciting LLM-integrated applications. However, as LLMs have improved, so have the attacks against them. Prompt injection attack is listed as the #1 threat by OWASP to LLM-integrated ap...
High signal Matched: cost, model, evaluation, training, dpo, fine-tuning, retrieval, api, sota
SqueezeBits · korea · 2025-04-11
Discover how OwLite simplifies AI model optimization with seamless integration and secure architecture.
High signal Matched: performance, model, quantization
BAIR · research · 2025-04-08
PLAID is a multimodal generative model that simultaneously generates protein 1D sequence and 3D structure, by learning the latent space of protein folding models. The awarding of the 2024 Nobel Prize to AlphaFold2 marks an important moment...
High signal Matched: inference, generation, cost, model, weights, research, training, retrieval
Nota AI · korea · 2025-04-08
Seul-Ki Yeom, Ph. D. Research Lead, Nota AI GmbH Tae-Ho KimCTO & Co-Founder, Nota AI SummaryDelivers real-time AI performance on edge devices such as smartphones, IoT devices, and embedded systems.Introduces a novel "Reus...
High signal Matched: inference, kernel, benchmark, performance, cost, introducing, model, paper, research, benchmarks
SkyPilot · open-source · 2025-04-08
Techniques to speed up checkpointing by 9.6x and how to easily achieve them in SkyPilot
High signal Matched: performance, model, cloud, checkpointing
Hugging Face · open-source · 2025-04-02
No feed summary available yet.
High signal Matched: performance
SqueezeBits · korea · 2025-03-26
With TensorRT-LLM now open source, we can finally take a deep dive into the secret sauce behind its impressive performance.
High signal Matched: performance, open source
AIBrix · open-source · 2025-03-10
This blog post introduces deploying DeepSeek R1 using AIBrix. DeepSeek-R1 demonstrates remarkable proficiency in reasoning tasks through step-by-step training process. It features 671B total parameters with 37B active parameters, and 128k...
High signal Matched: inference, distributed, benchmark, model, weights, training, context length
SkyPilot · open-source · 2025-03-05
SkyPilot uses the venerable SQLite for state management. SQLite can handle millions of QPS, and terabytes of data. However, our efforts to scale our Managed Jobs feature ran up against the one downfall of SQLite: many concurrent writers. S...
High signal Matched: qps
SqueezeBits · korea · 2025-02-27
This article introduces Fits on Chips, an LLMOps toolkit for performance evaluation.
High signal Matched: performance, evaluation
Nota AI · korea · 2025-02-25
Hancheol Park, Ph. D.AI Research Engineer, Nota AI Geonmin Kim, Ph. D.AI Research Engineer, Nota AI Jaeyeon KimAI Research Engineer, Nota AI SummaryIn this study, we propose a method for determining whether given multilingual...
High signal Matched: generation, performance, model, paper, research, training, fine-tuning
AIBrix · open-source · 2025-02-21
Open-source large language models (LLMs) like LLaMA, Deepseek, Qwen and Mistral etc have surged in popularity, offering enterprises greater flexibility, cost savings, and control over their AI deployments. These models have empowered organ...
High signal Matched: inference, generation, latency, cost, introducing, model, agents, open-source
AIBrix · open-source · 2025-02-19
We’re excited to announce the v0.2.0 release of AIBrix! Building on feedback from v0.1.0 production adoption and user interest, this release introduces several new features to enhance performance and usability. Extend the vLLM Prefix...
High signal Matched: inference, serving, prefill, throughput, distributed, multi-node, kv cache, prefix cache, performance, cost, gpu, accelerator, release, agent
Nota AI · korea · 2025-02-10
Hancheol Park, Ph. D.AI Research Engineer, Nota AI Geonmin Kim, Ph. D.AI Research Engineer, Nota AI SummaryIn this study, we present a method for detecting ambiguous samples in natural language understanding (NLU) tasks using...
High signal Matched: performance, paper, research, evaluation, training, evaluate
Hugging Face · open-source · 2025-02-04
No feed summary available yet.
High signal Matched: benchmark, agent
SqueezeBits · korea · 2025-01-13
In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.
High signal Matched: performance, accelerator, fp8, quantization, evaluate
Hugging Face · open-source · 2025-01-09
No feed summary available yet.
High signal Matched: performance, leaderboard
SqueezeBits · korea · 2025-01-06
In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.
High signal Matched: performance, accelerator, evaluation, evaluate
Modular · inference-infra · 2024-12-17
MAX GPU: State of the Art Throughput on a New GenAI platform
High signal Matched: throughput, gpu, state of the art
Hugging Face · open-source · 2024-12-17
No feed summary available yet.
High signal Matched: performance, model
Hugging Face · open-source · 2024-12-04
No feed summary available yet.
High signal Matched: benchmark, evaluation, leaderboard
Hugging Face · open-source · 2024-12-03
No feed summary available yet.
High signal Matched: performance
SqueezeBits · korea · 2024-12-03
In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.
High signal Matched: performance, accelerator, evaluation, evaluate
SqueezeBits · korea · 2024-11-21
In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.
High signal Matched: performance, accelerator, evaluate
Replicate · inference-infra · 2024-11-15
NVIDIA L40S GPUs are here, with better performance and lower cost.
High signal Matched: performance, cost
AIBrix · open-source · 2024-11-13
In recent years, large language models (LLMs) have revolutionized AI applications, powering solutions in areas like chatbots, automated content generation, and advanced recommendation engines. Services like OpenAI’s have gained significant...
High signal Matched: decoding, prefill, generation, kv cache, performance, cost, gpu, release, introducing, cloud, open-source
SqueezeBits · korea · 2024-10-30
This article provides a comparative analysis of vLLM and TensorRT-LLM frameworks, focusing on performance with fixed and dynamic datasets.
High signal Matched: performance
SqueezeBits · korea · 2024-10-18
This article provides a comparative analysis of vLLM and TensorRT-LLM frameworks with various sampling methods.
High signal Matched: performance
SqueezeBits · korea · 2024-10-01
This article provides a comparative analysis of vLLM and TensorRT-LLM frameworks for serving LLMs, evaluating their performance based on key metrics like throughput, TTFT, and TPOT to offer insights for practitioners in optimizing LLM depl...
High signal Matched: serving, throughput, performance, ttft, tpot, evaluation, evaluating
Modal · inference-infra · 2024-09-16
Learn how we used our new dynamic batching feature to improve throughput and reduce inference costs for the Whisper model with a single line of code!
High signal Matched: inference, throughput, model
Modular · inference-infra · 2024-09-13
MAX 24.5 - With SOTA CPU Performance for Llama 3.1
High signal Matched: performance, sota
Modal · inference-infra · 2024-09-10
A step-by-step guide to building a scalable analytics stack using Modal, dlt, and dbt for efficient data loading, transformation, and deployment.
High signal Matched: cost
Nota AI · korea · 2024-08-02
Jaeyeon KimResearch Engineer, Nota AI Geonmin KimResearch Engineer, Nota AI Hancheol ParkTeam Lead of NetsPresso Application, Nota AI IntroductionRecent large language models (LLMs) have demonstrated unprecedented performance...
High signal Matched: decoding, benchmark, performance, latency, tokens/sec, model, arxiv, research, technical report, evaluation, cloud, training, lora, benchmarks, leaderboard, open-source
Modal · inference-infra · 2024-07-09
Welcome to another round of Modal Product Updates! Here's what's new this month.
High signal Matched: latency
Hugging Face · open-source · 2024-07-01
No feed summary available yet.
High signal Matched: benchmark, agent
SqueezeBits · korea · 2024-06-26
Estimating the cost savings from model compression.
High signal Matched: cost, model
Nota AI · korea · 2024-06-13
Jeongho KimResearch Engineer, Nota AI SummaryOnline multi-camera system for efficient individual trackingAccurate ID management with Cluster Self-Refinement (CSR)Improved performance with enhanced pose estimation Intro...
High signal Matched: performance, model, paper, research, evaluation, leaderboard
Hugging Face · open-source · 2024-05-09
No feed summary available yet.
High signal Matched: cost, rag
Hugging Face · open-source · 2024-05-03
No feed summary available yet.
High signal Matched: performance, leaderboard
Modular · inference-infra · 2024-04-10
Row-major vs. Column-major Matrices: A Performance Analysis in Mojo and NumPy
High signal Matched: performance
SkyPilot · open-source · 2024-02-20
SkyServe: A simple, cost-efficient, multi-region/cloud library for serving GenAI models.
High signal Matched: serving, cost, introducing, cloud
SkyPilot · open-source · 2023-12-21
A tutorial for serving Mixtral 8x7B model with SkyPilot and SkyServe.
High signal Matched: serving, mixtral, cost, gpu, model
Replicate · inference-infra · 2023-11-10
An interactive example showing how to embed text using a state-of-the-art embedding model that beats OpenAI's embeddings API on price and performance.
High signal Matched: performance, model, api, open-source
Hugging Face · open-source · 2023-11-07
No feed summary available yet.
High signal Matched: performance, lora
Hugging Face · open-source · 2023-10-03
No feed summary available yet.
High signal Matched: performance
SkyPilot · open-source · 2023-09-27
Covariant runs AI on the cloud using SkyPilot, delivering models 4x faster cost-effectively.
High signal Matched: cost, cloud
Hugging Face · open-source · 2023-09-26
No feed summary available yet.
High signal Matched: benchmark, sagemaker
Hugging Face · open-source · 2023-09-01
No feed summary available yet.
High signal Matched: latency, sagemaker
Hugging Face · open-source · 2023-05-11
No feed summary available yet.
High signal Matched: generation, latency
SkyPilot · open-source · 2022-11-16
Introducing SkyPilot.
High signal Matched: cost, introducing, cloud
Hugging Face · open-source · 2022-10-19
No feed summary available yet.
High signal Matched: benchmark
Hugging Face · open-source · 2022-01-13
No feed summary available yet.
High signal Matched: latency
CoreWeave · cloud · 2026-06-03
No feed summary available yet.
Watchlist Matched: performance
AWS Machine Learning Blog · cloud · 2026-06-02
In this post, we walk through a practical implementation using KDB-X MCP server integration with Amazon Quick, demonstrating how traders and analysts can ask questions using conversational language and receive actionable insights from data...
Watchlist Matched: performance, mcp
Microsoft Research · big-tech · 2026-05-22
MagenticLite is an agentic system for small models that works across the browser and local file system in a single workflow. It combines specialized models and orchestration to support efficient agentic performance on everyday tasks. The p...
Watchlist Matched: performance, research, agentic
Microsoft Research · big-tech · 2026-05-21
Vega turns a full credential into a single proof, sharing only what is needed and nothing more, with performance that works in real apps. The post Vega: Zero-knowledge proofs for digital identity in the age of AI appeared first on Microsof...
Watchlist Matched: performance, research
AI2 · research · 2026-05-19
OlmoEarth v1.1 is a more efficient family of remote-sensing models that cuts compute costs by up to 3x while maintaining similar performance, making large-scale satellite mapping faster and cheaper to run.
Watchlist Matched: performance
Cloudflare Blog · cloud · 2026-05-13
We’ve enabled higher usage limits, faster performance, better reliability, and increased shipping velocity for our Browser Run product by rebuilding on top of Cloudflare’s Containers. Here’s how.
Watchlist Matched: performance
Lambda · cloud · 2026-05-04
Consider two teams provisioning 8,192 GPUs for a large training run. Same model, same dataset, same budget. Team A lands on a facility purpose-built for AI with sufficient power density, carefully engineered liquid cooling, a high-performa...
Watchlist Matched: performance, model, training
AI2 · research · 2026-02-25
PreScience is a new benchmark that evaluates whether AI can forecast how science unfolds end-to-end, from team formation through eventual impact.
Watchlist Matched: benchmark
LY Corporation Tech Blog · korea · 2026-01-14
Introduction: what are guardrails?Various mechanisms for making AI more safe to use are commonly ref...
Watchlist Matched: cost
BAIR · research · 2025-11-01
In this post, I’ll introduce a reinforcement learning (RL) algorithm based on an “alternative” paradigm: divide and conquer. Unlike traditional methods, this algorithm is not based on temporal difference (TD) learning (which has scalabilit...
Watchlist Matched: benchmark, performance, model, paper, training
LY Corporation Tech Blog · korea · 2025-09-19
The original article was published on October 24, 2024.Hello, I'm Munetoshi Ishikawa, a mobile clien...
Watchlist Matched: performance
BAIR · research · 2025-03-25
Training Diffusion Models with Reinforcement Learning We deployed 100 reinforcement learning (RL)-controlled cars into rush-hour highway traffic to smooth congestion and reduce fuel consumption for everyone. Our goal is to tackle "stop-and...
Watchlist Matched: throughput, kernel, performance, model, paper, training, agent, agents