Moreh · korea · 2026-06-03
Moreh vLLM Performance Evaluation: DeepSeek V3/R1 671B on AMD Instinct MI300X GPUs
No feed summary available yet.
High signal Matched: performance, mi300x, evaluation
Moreh · korea · 2026-06-03
No feed summary available yet.
High signal Matched: performance, mi300x, evaluation
Moreh · korea · 2026-06-03
No feed summary available yet.
High signal Matched: performance, mi300x, evaluation
Mooncake · open-source · 2026-06-03
No feed summary available yet.
High signal Matched: performance, benchmarks
Stanford CRFM · research · 2026-06-03
No feed summary available yet.
High signal Matched: model, evaluation
vLLM Project · open-source · 2026-06-01
A technical deep dive on running vLLM on NVIDIA DGX Spark and GB10 systems, covering sm_121 architecture, unified memory behavior, NVFP4 model serving, Nemotron-3-Super configuration, Docker deployment, Prometheus metrics, and local evalua...
High signal Matched: serving, model, evaluation
Nota AI · korea · 2026-05-29
Jaehoon Lee Technical Content Manager, Nota AI When enterprises adopt AI, the most common bottleneck is not model development. It is the deployment stage: getting a finished model to run reliably on the actual target device.T...
High signal Matched: inference, throughput, benchmark, performance, latency, cost, gpu, model, evaluation, quantization, int8, benchmarks, leaderboard
AWS Machine Learning Blog · cloud · 2026-05-29
This post combines learnings from LangChain’s work on evaluating deep agents and Anthropic’s guide to demystifying evals for AI agents into a practical guide. In this post, you will learn how to: 1) apply five evaluation patterns for deep...
High signal Matched: evaluation, bedrock, evals, evaluating, agent, agents
AWS Machine Learning Blog · cloud · 2026-05-29
Datasets in AgentCore is in public preview. Agent evaluation is most powerful when you combine fast-moving online signals with stable offline baselines. To understand whether your agent is truly improving over time, you need a fixed benchm...
High signal Matched: benchmark, evaluation, bedrock, agent
AMD ROCm Blogs · hardware · 2026-05-25
Local large language model (LLM) inference has rapidly evolved, but a persistent limitation remains: model size is constrained by available GPU memory. Discrete GPUs typically offer 8–24 GB of dedicated VRAM, which can limit the size of mo...
High signal Matched: inference, multi-gpu, gpu, model, checkpoint, cloud, quantization, evaluate
Lambda · cloud · 2026-05-20
What the numbers mean for financial services Executive summary Lambda is the first to publish an audited STAC-AI™ LANG6 result on NVIDIA HGX B200, with independently verified performance data that Financial Services Industry (FSI) infrastr...
High signal Matched: inference, generation, performance, gpu, h200, b200, model, evaluating
NVIDIA Technical Blog · hardware · 2026-05-19
Evaluating an AI model and evaluating an AI agent are related—but they answer fundamentally different questions. A model benchmark tests the capability of a...
High signal Matched: benchmark, model, evaluation, evaluating, agent, agentic
Together AI · inference-infra · 2026-05-19
Real-world inference benchmarks for coding agents: 31% more TPS than TensorRT-LLM, 2× better TTFT at saturation, and 76% lower cost than Claude Opus 4.6.
High signal Matched: inference, ttft, cost, benchmarks, agents
Microsoft Research · big-tech · 2026-05-16
Our recent paper, “LLMs Corrupt Your Documents When You Delegate”, has generated discussion about the reliability of AI systems in delegated workflows. We appreciate the interest in this work and want to clarify several important points ab...
High signal Matched: paper, research, evaluation
AI2 · research · 2026-05-13
AIMIP is a new open benchmark and dataset for evaluating AI climate models, showing they can match or beat conventional models on some historical climate metrics while still struggling to generalize reliably to long-term warming trends and...
High signal Matched: benchmark, introducing, model, evaluating
Nota AI · korea · 2026-05-11
Jaehoon Lee Technical Content Manager, Nota AI NetsPresso® now embraces AI agents. An easy-to-use interface sits on top of the validated pipeline that handles everything from model compression to device deployment.When a user...
High signal Matched: inference, endpoint, kernel, verification, moe, benchmark, latency, cost, gpu, release, model, evaluation, quantization, quantized, int4, evaluate, benchmarks, swe-bench, mmlu, agent, agents, api
BAIR · research · 2026-05-08
.apr-fig { text-align: center; margin: 1.35em 0; line-height: 1.4; } .apr-fig--wide img { display: inline-block; width: 100%; max-width: 100%; height: auto; vertical-align: middle; } .apr-fig--wide-0-8 { max-width: 80%; margin-left: auto;...
High signal Matched: inference, decoding, prefill, generation, serve, throughput, kv cache, verification, performance, latency, cost, model, paper, research, evaluation, training, pretraining, sft, benchmarks, long context, context window, agentic, reasoning model
SkyPilot · open-source · 2026-05-01
We ran hundreds of benchmarks to tune storage systems for distributed training so you don’t have to.
High signal Matched: distributed, training, distributed training, benchmarks
Nota AI · korea · 2026-04-29
Hancheol Park, Ph. D.AI Research Engineer, NetsPresso Tech, Nota AI Geonmin Kim, Ph. D.AI Research Engineer, NetsPresso Tech, Nota AI Geonho LeeEdge AI Engineer Intern, NetsPresso Tech, Nota AI Jaehoon Lee Technical Content Manager,...
High signal Matched: generation, moe, performance, model, weights, paper, research, evaluation, korea, korean, seoul, naver, training, fine-tuning, quantization, agent, agents, agentic
Nota AI · korea · 2026-04-22
Jaehoon Lee Technical Content Manager, Nota AI Series Notice: NetsPresso® Technical Blog, Part 2In Part 1, we walked through a scenario of deploying Llama 3.2 1B on an edge device to illustrate the NetsPresso® workflow. The f...
High signal Matched: inference, kernel, cuda, matmul, benchmark, performance, latency, cost, npu, model, weights, paper, research, evaluation, furiosa, training, quantization, int8, int4, awq, gptq, sdk, open-source
BAIR · research · 2026-04-20
.grasp-results-table table { font-size: 0.875rem; line-height: 1.35; width: 100%; } .grasp-results-table th, .grasp-results-table td { padding: 0.35rem 0.5rem; } /* Consistent whitespace between major sections (this post is long and hr-hea...
High signal Matched: performance, model, paper, arxiv, evaluation, training
Nota AI · korea · 2026-04-08
Jaehoon Lee Technical Content Manager, Nota AI AI Model Optimization: Why Models Won't Run on HardwareThe Chip Is Ready, but the Model Won't DeployIf you have ever tried deploying an AI model onto your own chip, the following...
High signal Matched: inference, multi-gpu, kv cache, verification, performance, latency, gpu, model, research, evaluation, quantization, quantized, awq, gptq, evaluate
Nota AI · korea · 2026-03-13
Hancheol Park, Ph. D. AI Research Engineer, Nota AI Tairen PiaoAI Research Engineer, Nota AI Tae-Ho KimCTO & Co-Founder, Nota AI ✔️ Resource : The official quantized model of Solar-Open-100B, which passed the first round of Sout...
High signal Matched: inference, serving, prefill, generation, throughput, moe, router, benchmark, performance, latency, ttft, tpot, blackwell, release, model, weights, open model, research, evaluation, korea, korean, upstage, training, post-training, quantization, quantized, int4, evaluate, benchmarks, mmlu, long-context
BAIR · research · 2026-03-13
--> Understanding the behavior of complex machine learning systems, particularly Large Language Models (LLMs), is a critical challenge in modern artificial intelligence. Interpretability research aims to make the decision-making process mo...
High signal Matched: inference, serving, decoding, performance, cost, model, research, training, evaluate, mmlu, long-context, rag
Nota AI · korea · 2026-02-26
Jewon Lee | Wooksu Shin | Seungmin Yang | Ki-Ung Song | Donguk Lim | Jaeyeon Kim | Tae-Ho Kim | Bo-Kyeong KimEdgeFM Team, Nota AI ✔️ Resources for more information: GitHub, ArXiv, Project Page, Demo.✔️ Accepted at ICLR 2026. &...
High signal Matched: inference, generation, verification, benchmark, performance, latency, cost, model, arxiv, evaluation, training, post-training, benchmarks
Together AI · inference-infra · 2026-02-23
State-of-the-art speech models like Whisper and Deepgram score near-human on benchmarks — then fail 39% of the time on street names. New research from Together AI exposes the gap and a fix.
High signal Matched: research, benchmarks
Together AI · inference-infra · 2026-02-02
Fine-tuned open-source LLM judges can outperform GPT-5.2 at evaluating model outputs. Using Direct Preference Optimization on just 5,400 preference pairs, we trained GPT-OSS 120B to beat GPT-5.2 on human preference alignment—at 15x lower c...
High signal Matched: inference, cost, model, fine-tuning, evaluating, open-source, oss
Hugging Face · open-source · 2026-01-27
No feed summary available yet.
High signal Matched: evaluation
Together AI · inference-infra · 2026-01-26
Introducing DSGym—a holisti evaluation and training framework for LLM-based data science agents. Features 90+ bioinformatics tasks, 92 Kaggle competitions, and synthetic trajectory generation. Our 4B model achieves state-of-the-art perform...
High signal Matched: generation, performance, introducing, model, evaluation, training, evaluating, agents, open-source
BAIR · research · 2026-01-10
An encoder (optical system) maps objects to noiseless images, which noise corrupts into measurements. Our information estimator uses only these noisy measurements and a noise model to quantify how well measurements distinguish objects. Man...
High signal Matched: performance, model, paper, evaluation, training, evaluate
Together AI · inference-infra · 2026-01-08
Learn how to choose the right open-source model for production by evaluating model quality, benchmarking performance, and deploying open models that balance cost, speed, and accuracy.
High signal Matched: performance, cost, model, open model, evaluating, open-source
Nota AI · korea · 2025-12-19
Seungmin YangEdgeFM Lead, Nota AI On this page ▾ SummaryWith the introduction of NVFP4—a new 4-bit floating point data type in NVIDIA’s Blackwell GPU architecture—LLM inference achieves markedly improved efficiency.Blackwell’s NVFP4...
High signal Matched: inference, serving, decoding, prefill, generation, token generation, throughput, kernel, gemm, cutlass, distributed, benchmark, performance, latency, ttft, tpot, tokens/sec, cost, gpu, blackwell, launch, model, weights, fp8, research, training, post-training, quantization, quantized, awq, benchmarks, mmlu, retrieval
Hugging Face · open-source · 2025-12-17
No feed summary available yet.
High signal Matched: evaluation
Together AI · inference-infra · 2025-12-01
Together AI achieves up to 2x faster inference for top open-source models like Qwen, DeepSeek, and Kimi through GPU optimization, advanced speculative decoding, and FP4 quantization—ranking #1 in speed benchmarks on NVIDIA Blackwell archit...
High signal Matched: inference, decoding, speculative decoding, gpu, blackwell, quantization, benchmarks, open-source
AIBrix · open-source · 2025-11-10
🚀 AIBrix v0.5.0 Release Today, we’re excited to announce AIBrix v0.5.0, a release that pushes AIBrix closer to a batteries-included control plane for modern LLM workloads. This release introduces an OpenAI-compatible Batch API for hi...
High signal Matched: prefill, latency, release, evaluation, api, openai-compatible
Together AI · inference-infra · 2025-11-04
Understanding how to evaluate and benchmark Large Language Models (LLMS). Test, compare, and understand LLMs.
High signal Matched: benchmark, evaluate
Hugging Face · open-source · 2025-10-01
No feed summary available yet.
High signal Matched: introducing, evaluation, retrieval
llm-d · open-source · 2025-09-24
See how llm-d's precise KV-cache aware scheduling delivers 57x faster responses and 2x throughput in production distributed LLM inference benchmarks.
High signal Matched: inference, throughput, distributed, benchmarks
Together AI · inference-infra · 2025-08-27
Access DeepSeek-V3.1 on Together AI: MIT-licensed hybrid model with thinking/non-thinking modes, 66% SWE-bench Verified, serverless deployment, 99.9% SLA.
High signal Matched: deepseek-v3, model, swe-bench
Together AI · inference-infra · 2025-07-25
Unlock agentic coding with Qwen3-Coder on Together AI: 256K context, SWE-bench rivaling Claude Sonnet 4, zero-setup instant deployment.
High signal Matched: model, swe-bench, agentic
Nota AI · korea · 2025-07-10
Marcel Simon, Ph. D.ML Researcher, Nota AI GmbH Tae-Ho KimCTO & Co-Founder, Nota AI Seul-Ki Yeom, Ph. D.Research Lead, Nota AI GmbH SummaryProposes a simple next-frame prediction task using unlabeled video to enhance sing...
High signal Matched: inference, performance, model, paper, research, training, fine-tuning, benchmarks
Hugging Face · open-source · 2025-07-04
No feed summary available yet.
High signal Matched: evaluation, training
Modal · inference-infra · 2025-07-02
There's only one playbook for improving generative applications. Read about it here.
High signal Matched: inference, evals
BAIR · research · 2025-07-01
.modal { display: none; position: fixed; z-index: 9999; padding-top: 50px; left: 0; top: 0; width: 100%; height: 100%; overflow: auto; background-color: rgba(0,0,0,0.9); } .modal-content { margin: auto; display: block; max-width: 90%; max-...
High signal Matched: inference, generation, performance, model, paper, arxiv, evaluation, training, evaluate, agent, agents
Hugging Face · open-source · 2025-06-06
No feed summary available yet.
High signal Matched: evaluation, agents
Nota AI · korea · 2025-05-07
Jewon Lee | Ki-Ung Song | Seungmin Yang | Donguk Lim | Jaeyeon Kim | Wooksu Shin | Bo-Kyeong Kim | Tae-Ho KimEdgeFM Team, Nota AI Yong Jae Lee, Ph. D.Associate Professor, UW-Madison SummaryOur method, Trimmed-Llama, reduces t...
High signal Matched: inference, generation, kv cache, benchmark, performance, latency, model, weights, research, training, benchmarks, open-source
Hugging Face · open-source · 2025-04-16
No feed summary available yet.
High signal Matched: introducing, evaluating, long-context
BAIR · research · 2025-04-11
Recent advances in Large Language Models (LLMs) enable exciting LLM-integrated applications. However, as LLMs have improved, so have the attacks against them. Prompt injection attack is listed as the #1 threat by OWASP to LLM-integrated ap...
High signal Matched: cost, model, evaluation, training, dpo, fine-tuning, retrieval, api, sota
Nota AI · korea · 2025-04-08
Seul-Ki Yeom, Ph. D. Research Lead, Nota AI GmbH Tae-Ho KimCTO & Co-Founder, Nota AI SummaryDelivers real-time AI performance on edge devices such as smartphones, IoT devices, and embedded systems.Introduces a novel "Reus...
High signal Matched: inference, kernel, benchmark, performance, cost, introducing, model, paper, research, benchmarks
SqueezeBits · korea · 2025-02-27
This article introduces Fits on Chips, an LLMOps toolkit for performance evaluation.
High signal Matched: performance, evaluation
Nota AI · korea · 2025-02-10
Hancheol Park, Ph. D.AI Research Engineer, Nota AI Geonmin Kim, Ph. D.AI Research Engineer, Nota AI SummaryIn this study, we present a method for detecting ambiguous samples in natural language understanding (NLU) tasks using...
High signal Matched: performance, paper, research, evaluation, training, evaluate
SqueezeBits · korea · 2025-01-13
In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.
High signal Matched: performance, accelerator, fp8, quantization, evaluate
Hugging Face · open-source · 2025-01-09
No feed summary available yet.
High signal Matched: performance, leaderboard
SqueezeBits · korea · 2025-01-06
In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.
High signal Matched: performance, accelerator, evaluation, evaluate
Hugging Face · open-source · 2024-12-04
No feed summary available yet.
High signal Matched: benchmark, evaluation, leaderboard
SqueezeBits · korea · 2024-12-03
In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.
High signal Matched: performance, accelerator, evaluation, evaluate
SqueezeBits · korea · 2024-11-21
In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.
High signal Matched: performance, accelerator, evaluate
Hugging Face · open-source · 2024-11-20
No feed summary available yet.
High signal Matched: introducing, leaderboard
Hugging Face · open-source · 2024-11-04
No feed summary available yet.
High signal Matched: evaluation, fine-tuning
Hugging Face · open-source · 2024-10-04
No feed summary available yet.
High signal Matched: introducing, leaderboard
SqueezeBits · korea · 2024-10-01
This article provides a comparative analysis of vLLM and TensorRT-LLM frameworks for serving LLMs, evaluating their performance based on key metrics like throughput, TTFT, and TPOT to offer insights for practitioners in optimizing LLM depl...
High signal Matched: serving, throughput, performance, ttft, tpot, evaluation, evaluating
Modal · inference-infra · 2024-08-05
Scale up smaller open models with search and evaluation to match frontier capabilities.
High signal Matched: evaluation
Nota AI · korea · 2024-08-02
Jaeyeon KimResearch Engineer, Nota AI Geonmin KimResearch Engineer, Nota AI Hancheol ParkTeam Lead of NetsPresso Application, Nota AI IntroductionRecent large language models (LLMs) have demonstrated unprecedented performance...
High signal Matched: decoding, benchmark, performance, latency, tokens/sec, model, arxiv, research, technical report, evaluation, cloud, training, lora, benchmarks, leaderboard, open-source
Hugging Face · open-source · 2024-07-25
No feed summary available yet.
High signal Matched: evaluation, fine-tuning
Nota AI · korea · 2024-06-13
Jeongho KimResearch Engineer, Nota AI SummaryOnline multi-camera system for efficient individual trackingAccurate ID management with Cluster Self-Refinement (CSR)Improved performance with enhanced pose estimation Intro...
High signal Matched: performance, model, paper, research, evaluation, leaderboard
Hugging Face · open-source · 2024-05-24
No feed summary available yet.
High signal Matched: evaluation
Hugging Face · open-source · 2024-05-14
No feed summary available yet.
High signal Matched: introducing, leaderboard
Hugging Face · open-source · 2024-05-05
No feed summary available yet.
High signal Matched: introducing, leaderboard
Hugging Face · open-source · 2024-05-03
No feed summary available yet.
High signal Matched: performance, leaderboard
Hugging Face · open-source · 2024-04-23
No feed summary available yet.
High signal Matched: introducing, leaderboard
Hugging Face · open-source · 2024-04-16
No feed summary available yet.
High signal Matched: introducing, evaluation, leaderboard
Hugging Face · open-source · 2024-02-23
No feed summary available yet.
High signal Matched: introducing, leaderboard
Hugging Face · open-source · 2024-02-20
No feed summary available yet.
High signal Matched: introducing, evaluation, korean, leaderboard
Hugging Face · open-source · 2024-01-31
No feed summary available yet.
High signal Matched: introducing, leaderboard
Hugging Face · open-source · 2022-10-24
No feed summary available yet.
High signal Matched: model, evaluate, evaluating
Hugging Face · open-source · 2022-06-28
No feed summary available yet.
High signal Matched: evaluation
Mistral AI · model-lab · 2026-06-03
No feed summary available yet.
Watchlist Matched: evaluate
Anthropic · model-lab · 2026-06-03
No feed summary available yet.
Watchlist Matched: eval
Anthropic · model-lab · 2026-06-03
No feed summary available yet.
Watchlist Matched: evals
Stanford CRFM · research · 2026-06-03
No feed summary available yet.
Watchlist Matched: evaluating
Stanford CRFM · research · 2026-06-03
No feed summary available yet.
Watchlist Matched: leaderboard
vLLM Project · open-source · 2026-05-11
How vLLM built the leading deployments of DeepSeek V3.2, MiniMax-M2.5, and Qwen 3.5 397B.
Watchlist Matched: leaderboard
AI2 · research · 2026-05-11
Artificial Analysis uses Ai2’s open IFBench eval because it captures a stubborn, real-world capability many benchmarks miss: whether models can reliably follow complex, multi-part user instructions.
Watchlist Matched: eval, benchmarks
Hugging Face · open-source · 2026-05-06
No feed summary available yet.
Watchlist Matched: leaderboard
Together AI · inference-infra · 2026-04-30
Together AI and Adaption partner to bring Together Fine-Tuning natively into Adaptive Data, helping teams optimize datasets, run fine-tuning, evaluate results, and deploy stronger open models.
Watchlist Matched: fine-tuning, evaluate
Hugging Face · open-source · 2026-04-21
No feed summary available yet.
Watchlist Matched: leaderboard
AI2 · research · 2026-04-13
Two benchmarks developed at Ai2 – ScienceWorld and DiscoveryWorld – reveal that even incredibly strong AI science agents struggle with problems human scientists solve routinely.
Watchlist Matched: evaluating, benchmarks, agents
Google Research · big-tech · 2026-04-03
Generative AI
Watchlist Matched: evaluating
Google Research · big-tech · 2026-04-01
Algorithms & Theory
Watchlist Matched: benchmarks
Hugging Face · open-source · 2026-03-24
No feed summary available yet.
Watchlist Matched: evaluating, agents
Hugging Face · open-source · 2026-02-12
No feed summary available yet.
Watchlist Matched: evaluating, agents
Hugging Face · open-source · 2026-02-04
No feed summary available yet.
Watchlist Matched: evals
Hugging Face · open-source · 2026-01-21
No feed summary available yet.
Watchlist Matched: benchmarks, agent
Hugging Face · open-source · 2025-11-21
No feed summary available yet.
Watchlist Matched: leaderboard
Together AI · inference-infra · 2025-10-28
Test AI agents in the real world with Collinear TraitMix and Together Evals: dynamic persona simulations, multi-turn dialogs, and LLM-as-judge scoring.
Watchlist Matched: evals, agent, agents
Google Research · big-tech · 2025-08-26
Generative AI
Watchlist Matched: evaluating
Hugging Face · open-source · 2025-07-17
No feed summary available yet.
Watchlist Matched: evaluating, agents
Together AI · inference-infra · 2025-06-12
Build a data scientist agent using Together’s open-source models and Code Interpreter—easy to implement, solid benchmarks, and full code on GitHub.
Watchlist Matched: benchmarks, agent, open-source
Hugging Face · open-source · 2025-02-28
No feed summary available yet.
Watchlist Matched: evaluate, agent
Hugging Face · open-source · 2025-02-14
No feed summary available yet.
Watchlist Matched: leaderboard
Hugging Face · open-source · 2025-02-10
No feed summary available yet.
Watchlist Matched: leaderboard
Hugging Face · open-source · 2024-12-20
No feed summary available yet.
Watchlist Matched: evaluating
Modular · inference-infra · 2024-12-19
Evaluating Llama Guard with MAX 24.6 and Hugging Face
Watchlist Matched: evaluating
Replicate · inference-infra · 2024-06-28
Google's Gemma2 models, language model leaderboard, tips for Stable Diffusion 3
Watchlist Matched: model, leaderboard
Hugging Face · open-source · 2024-06-06
No feed summary available yet.
Watchlist Matched: leaderboard
Hugging Face · open-source · 2024-04-19
No feed summary available yet.
Watchlist Matched: leaderboard
Hugging Face · open-source · 2024-02-02
No feed summary available yet.
Watchlist Matched: leaderboard
Hugging Face · open-source · 2024-01-29
No feed summary available yet.
Watchlist Matched: leaderboard
Hugging Face · open-source · 2024-01-26
No feed summary available yet.
Watchlist Matched: leaderboard
Hugging Face · open-source · 2024-01-12
No feed summary available yet.
Watchlist Matched: leaderboard
Hugging Face · open-source · 2023-12-01
No feed summary available yet.
Watchlist Matched: leaderboard
Hugging Face · open-source · 2023-09-18
No feed summary available yet.
Watchlist Matched: leaderboard
Hugging Face · open-source · 2023-06-23
No feed summary available yet.
Watchlist Matched: leaderboard
Hugging Face · open-source · 2022-10-03
No feed summary available yet.
Watchlist Matched: evaluate