evals

inference serving benchmark hardware model-release research quantization evals

High signal Matched: serving, model, evaluation

Nota AI · korea · 2026-05-29

Full-Stack Optimization for Low-Light Video on Jetson Orin NX: From 400 ms to 28 ms

Score 23

  Jaehoon Lee Technical Content Manager, Nota AI   When enterprises adopt AI, the most common bottleneck is not model development. It is the deployment stage: getting a finished model to run reliably on the actual target device.T...

High signal Matched: inference, throughput, benchmark, performance, latency, cost, gpu, model, evaluation, quantization, int8, benchmarks, leaderboard

AWS Machine Learning Blog · cloud · 2026-05-29

Evaluating Deep Agents using LangSmith on AWS

Score 9

This post combines learnings from LangChain’s work on evaluating deep agents and Anthropic’s guide to demystifying evals for AI agents into a practical guide. In this post, you will learn how to: 1) apply five evaluation patterns for deep...

research cloud evals agents

High signal Matched: evaluation, bedrock, evals, evaluating, agent, agents

AWS Machine Learning Blog · cloud · 2026-05-29

Build a test suite that grows with your agent with dataset management in Amazon Bedrock AgentCore

Score 13

Datasets in AgentCore is in public preview. Agent evaluation is most powerful when you combine fast-moving online signals with stable offline baselines. To understand whether your agent is truly improving over time, you need a fixed benchm...

benchmark research cloud evals agents

inference distributed hardware model-release cloud quantization evals

High signal Matched: benchmark, evaluation, bedrock, agent

AMD ROCm Blogs · hardware · 2026-05-25

AI Inference on AMD Ryzen™ AI Max Processor

Score 20

Local large language model (LLM) inference has rapidly evolved, but a persistent limitation remains: model size is constrained by available GPU memory. Discrete GPUs typically offer 8–24 GB of dedicated VRAM, which can limit the size of mo...

inference benchmark hardware model-release evals

High signal Matched: inference, multi-gpu, gpu, model, checkpoint, cloud, quantization, evaluate

Lambda · cloud · 2026-05-20

Lambda’s NVIDIA HGX B200 on STAC-AI™ LANG6

Score 18

What the numbers mean for financial services Executive summary Lambda is the first to publish an audited STAC-AI™ LANG6 result on NVIDIA HGX B200, with independently verified performance data that Financial Services Industry (FSI) infrastr...

benchmark model-release research evals agents

High signal Matched: inference, generation, performance, gpu, h200, b200, model, evaluating

NVIDIA Technical Blog · hardware · 2026-05-19

Mastering Agentic Techniques: AI Agent Evaluation

Score 16

Evaluating an AI model and evaluating an AI agent are related—but they answer fundamentally different questions. A model benchmark tests the capability of a...

inference benchmark evals agents

High signal Matched: benchmark, model, evaluation, evaluating, agent, agentic

Together AI · inference-infra · 2026-05-19

Benchmarking inference at scale: coding agents

Score 16

Real-world inference benchmarks for coding agents: 31% more TPS than TensorRT-LLM, 2× better TTFT at saturation, and 76% lower cost than Claude Opus 4.6.

High signal Matched: inference, ttft, cost, benchmarks, agents

Microsoft Research · big-tech · 2026-05-16

Further Notes on Our Recent Research on AI Delegation and Long-Horizon Reliability

Score 10

Our recent paper, “LLMs Corrupt Your Documents When You Delegate”, has generated discussion about the reliability of AI systems in delegated workflows. We appreciate the interest in this work and want to clarify several important points ab...

benchmark model-release evals

High signal Matched: paper, research, evaluation

AI2 · research · 2026-05-13

Introducing AIMIP: The AI weather and climate model intercomparison project

Score 14

AIMIP is a new open benchmark and dataset for evaluating AI climate models, showing they can match or beat conventional models on some historical climate metrics while still struggling to generalize reliably to long-term warming trends and...

inference serving kernel speculative-decoding moe benchmark hardware model-release research quantization evals agents api

High signal Matched: benchmark, introducing, model, evaluating

Nota AI · korea · 2026-05-11

[NetsPresso® x AI Agents] Easier to Use, Even More Powerful

Score 52

  Jaehoon Lee Technical Content Manager, Nota AI   NetsPresso® now embraces AI agents. An easy-to-use interface sits on top of the validated pipeline that handles everything from model compression to device deployment.When a user...

inference serving kv-cache speculative-decoding benchmark model-release research training fine-tuning evals long-context agents frontier-model

High signal Matched: inference, endpoint, kernel, verification, moe, benchmark, latency, cost, gpu, release, model, evaluation, quantization, quantized, int4, evaluate, benchmarks, swe-bench, mmlu, agent, agents, api

BAIR · research · 2026-05-08

Adaptive Parallel Reasoning: The Next Paradigm in Efficient Inference Scaling

Score 28

.apr-fig { text-align: center; margin: 1.35em 0; line-height: 1.4; } .apr-fig--wide img { display: inline-block; width: 100%; max-width: 100%; height: auto; vertical-align: middle; } .apr-fig--wide-0-8 { max-width: 80%; margin-left: auto;...

distributed training evals

High signal Matched: inference, decoding, prefill, generation, serve, throughput, kv cache, verification, performance, latency, cost, model, paper, research, evaluation, training, pretraining, sft, benchmarks, long context, context window, agentic, reasoning model

SkyPilot · open-source · 2026-05-01

Cache Me If You Can: Tuning Object Stores for AI

Score 8

We ran hundreds of benchmarks to tune storage systems for distributed training so you don’t have to.

High signal Matched: distributed, training, distributed training, benchmarks

Nota AI · korea · 2026-04-29

[NVIDIA Nemotron Hackathon] Grand Prize Among 20 Teams: Behind Two Sleepless Days

Score 32

  Hancheol Park, Ph. D.AI Research Engineer, NetsPresso Tech, Nota AI Geonmin Kim, Ph. D.AI Research Engineer, NetsPresso Tech, Nota AI Geonho LeeEdge AI Engineer Intern, NetsPresso Tech, Nota AI Jaehoon Lee Technical Content Manager,...

inference moe benchmark model-release research korea training fine-tuning quantization evals agents

High signal Matched: generation, moe, performance, model, weights, paper, research, evaluation, korea, korean, seoul, naver, training, fine-tuning, quantization, agent, agents, agentic

Nota AI · korea · 2026-04-22

[Deep Dive: NetsPresso®] From Quantization to Graph Optimization: A Step-by-Step Model Deployment Pipeline

Score 54

  Jaehoon Lee Technical Content Manager, Nota AI   Series Notice: NetsPresso® Technical Blog, Part 2In Part 1, we walked through a scenario of deploying Llama 3.2 1B on an edge device to illustrate the NetsPresso® workflow. The f...

inference kernel cuda benchmark hardware model-release research korea training quantization evals api open-source

benchmark model-release research training evals

High signal Matched: inference, kernel, cuda, matmul, benchmark, performance, latency, cost, npu, model, weights, paper, research, evaluation, furiosa, training, quantization, int8, int4, awq, gptq, sdk, open-source

BAIR · research · 2026-04-20

Gradient-based Planning for World Models at Longer Horizons

Score 16

.grasp-results-table table { font-size: 0.875rem; line-height: 1.35; width: 100%; } .grasp-results-table th, .grasp-results-table td { padding: 0.35rem 0.5rem; } /* Consistent whitespace between major sections (this post is long and hr-hea...

inference distributed kv-cache speculative-decoding benchmark hardware model-release research quantization evals

High signal Matched: performance, model, paper, arxiv, evaluation, training

Nota AI · korea · 2026-04-08

[Overview: NetsPresso®] A Platform That Handles Everything from Model Optimization to Target Deployment

Score 36

  Jaehoon Lee Technical Content Manager, Nota AI   AI Model Optimization: Why Models Won't Run on HardwareThe Chip Is Ready, but the Model Won't DeployIf you have ever tried deploying an AI model onto your own chip, the following...

inference serving moe benchmark hardware model-release research korea training quantization evals long-context open-source

High signal Matched: inference, multi-gpu, kv cache, verification, performance, latency, gpu, model, research, evaluation, quantization, quantized, awq, gptq, evaluate

Nota AI · korea · 2026-03-13

NotaMoEQuantization: An MoE-Specific Quantization Method for Solar-Open-100B

Score 62

  Hancheol Park, Ph. D. AI Research Engineer, Nota AI Tairen PiaoAI Research Engineer, Nota AI Tae-Ho KimCTO & Co-Founder, Nota AI ✔️ Resource : The official quantized model of Solar-Open-100B, which passed the first round of Sout...

inference serving benchmark model-release research training evals long-context rag

High signal Matched: inference, serving, prefill, generation, throughput, moe, router, benchmark, performance, latency, ttft, tpot, blackwell, release, model, weights, open model, research, evaluation, korea, korean, upstage, training, post-training, quantization, quantized, int4, evaluate, benchmarks, mmlu, long-context

BAIR · research · 2026-03-13

Identifying Interactions at Scale for LLMs

Score 18

--> Understanding the behavior of complex machine learning systems, particularly Large Language Models (LLMs), is a critical challenge in modern artificial intelligence. Interpretability research aims to make the decision-making process mo...

inference speculative-decoding benchmark model-release research training evals

High signal Matched: inference, serving, decoding, performance, cost, model, research, training, evaluate, mmlu, long-context, rag

Nota AI · korea · 2026-02-26

ERGO: Efficient High-Resolution Visual Understanding for Vision-Language Models

Score 24

High signal Matched: inference, generation, verification, benchmark, performance, latency, cost, model, arxiv, evaluation, training, post-training, benchmarks

Together AI · inference-infra · 2026-02-23

How speech models fail where it matters the most and what to do about it

Score 10

State-of-the-art speech models like Whisper and Deepgram score near-human on benchmarks — then fail 39% of the time on street names. New research from Together AI exposes the gap and a fix.

inference benchmark model-release fine-tuning evals open-source

High signal Matched: research, benchmarks

Together AI · inference-infra · 2026-02-02

Fine-tuning open LLM judges to outperform GPT-5.2

Score 14

Fine-tuned open-source LLM judges can outperform GPT-5.2 at evaluating model outputs. Using Direct Preference Optimization on just 5,400 preference pairs, we trained GPT-OSS 120B to beat GPT-5.2 on human preference alignment—at 15x lower c...

High signal Matched: inference, cost, model, fine-tuning, evaluating, open-source, oss

Hugging Face · open-source · 2026-01-27

Alyah ⭐️: Toward Robust Evaluation of Emirati Dialect Capabilities in Arabic LLMs

Score 10

No feed summary available yet.

inference benchmark model-release research training evals agents open-source

High signal Matched: evaluation

Together AI · inference-infra · 2026-01-26

DSGym: A holistic framework for evaluating and training data science agents

Score 18

Introducing DSGym—a holisti evaluation and training framework for LLM-based data science agents. Features 90+ bioinformatics tasks, 92 Kaggle competitions, and synthetic trajectory generation. Our 4B model achieves state-of-the-art perform...

benchmark model-release research training evals

High signal Matched: generation, performance, introducing, model, evaluation, training, evaluating, agents, open-source

BAIR · research · 2026-01-10

Information-Driven Design of Imaging Systems

Score 12

An encoder (optical system) maps objects to noiseless images, which noise corrupts into measurements. Our information estimator uses only these noisy measurements and a noise model to quantify how well measurements distinguish objects. Man...

benchmark model-release evals open-source

High signal Matched: performance, model, paper, evaluation, training, evaluate

Together AI · inference-infra · 2026-01-08

How to choose the right open model for production

Score 20

Learn how to choose the right open-source model for production by evaluating model quality, benchmarking performance, and deploying open models that balance cost, speed, and accuracy.

High signal Matched: performance, cost, model, open model, evaluating, open-source

Nota AI · korea · 2025-12-19

NVIDIA Blackwell; The Impact of NVFP4 For LLM Inference

Score 74

  Seungmin YangEdgeFM Lead, Nota AI On this page ▾ SummaryWith the introduction of NVFP4—a new 4-bit floating point data type in NVIDIA’s Blackwell GPU architecture—LLM inference achieves markedly improved efficiency.Blackwell’s NVFP4...

inference serving kernel cuda distributed benchmark hardware model-release research training quantization evals rag

High signal Matched: inference, serving, decoding, prefill, generation, token generation, throughput, kernel, gemm, cutlass, distributed, benchmark, performance, latency, ttft, tpot, tokens/sec, cost, gpu, blackwell, launch, model, weights, fp8, research, training, post-training, quantization, quantized, awq, benchmarks, mmlu, retrieval

Hugging Face · open-source · 2025-12-17

The Open Evaluation Standard: Benchmarking NVIDIA Nemotron 3 Nano with NeMo Evaluator

Score 10

No feed summary available yet.

inference speculative-decoding hardware quantization evals open-source

High signal Matched: evaluation

Together AI · inference-infra · 2025-12-01

Together AI delivers fastest inference for the top open-source models

Score 20

Together AI achieves up to 2x faster inference for top open-source models like Qwen, DeepSeek, and Kimi through GPU optimization, advanced speculative decoding, and FP4 quantization—ranking #1 in speed benchmarks on NVIDIA Blackwell archit...

inference benchmark model-release research evals api

High signal Matched: inference, decoding, speculative decoding, gpu, blackwell, quantization, benchmarks, open-source

AIBrix · open-source · 2025-11-10

AIBrix v0.5.0 Release: Batch API, KVCache v1 Connector, and Enhanced P/D orchestration

Score 22

🚀 AIBrix v0.5.0 Release Today, we’re excited to announce AIBrix v0.5.0, a release that pushes AIBrix closer to a batteries-included control plane for modern LLM workloads. This release introduces an OpenAI-compatible Batch API for hi...

High signal Matched: prefill, latency, release, evaluation, api, openai-compatible

Together AI · inference-infra · 2025-11-04

How to evaluate and benchmark Large Language Models (LLMs)

Score 12

Understanding how to evaluate and benchmark Large Language Models (LLMS). Test, compare, and understand LLMs.

benchmark evals

model-release research evals rag

High signal Matched: benchmark, evaluate

Hugging Face · open-source · 2025-10-01

Introducing RTEB: A New Standard for Retrieval Evaluation

Score 14

No feed summary available yet.

inference serving distributed benchmark evals

High signal Matched: introducing, evaluation, retrieval

llm-d · open-source · 2025-09-24

KV-Cache Wins You Can See: From Prefix Caching in vLLM to Distributed Scheduling with llm-d

Score 18

See how llm-d's precise KV-cache aware scheduling delivers 57x faster responses and 2x throughput in production distributed LLM inference benchmarks.

High signal Matched: inference, throughput, distributed, benchmarks

Together AI · inference-infra · 2025-08-27

DeepSeek-V3.1: Hybrid Thinking Model Now Available on Together AI

Score 16

Access DeepSeek-V3.1 on Together AI: MIT-licensed hybrid model with thinking/non-thinking modes, 66% SWE-bench Verified, serverless deployment, 99.9% SLA.

moe model-release evals

model-release evals agents

High signal Matched: deepseek-v3, model, swe-bench

Together AI · inference-infra · 2025-07-25

Qwen3-Coder: The Most Capable Agentic Coding Model Now Available on Together AI

Score 12

Unlock agentic coding with Qwen3-Coder on Together AI: 256K context, SWE-bench rivaling Claude Sonnet 4, zero-setup instant deployment.

inference benchmark model-release research training fine-tuning evals

High signal Matched: model, swe-bench, agentic

Nota AI · korea · 2025-07-10

Video Self-Distillation for Single-Image Encoders: Learning Temporal Priors from Unlabeled Video

Score 20

  Marcel Simon, Ph. D.ML Researcher, Nota AI GmbH Tae-Ho KimCTO & Co-Founder, Nota AI Seul-Ki Yeom, Ph. D.Research Lead, Nota AI GmbH   SummaryProposes a simple next-frame prediction task using unlabeled video to enhance sing...

High signal Matched: inference, performance, model, paper, research, training, fine-tuning, benchmarks

Hugging Face · open-source · 2025-07-04

Announcing NeurIPS 2025 E2LM Competition: Early Training Evaluation of Language Models

Score 10

No feed summary available yet.

research training evals

High signal Matched: evaluation, training

Modal · inference-infra · 2025-07-02

How we used evals and inference-time compute scaling to generate beautiful QR codes that actually work

Score 10

There's only one playbook for improving generative applications. Read about it here.

inference evals

inference benchmark model-release research training evals agents

High signal Matched: inference, evals

BAIR · research · 2025-07-01

Whole-Body Conditioned Egocentric Video Prediction

Score 10

.modal { display: none; position: fixed; z-index: 9999; padding-top: 50px; left: 0; top: 0; width: 100%; height: 100%; overflow: auto; background-color: rgba(0,0,0,0.9); } .modal-content { margin: auto; display: block; max-width: 90%; max-...

High signal Matched: inference, generation, performance, model, paper, arxiv, evaluation, training, evaluate, agent, agents

Hugging Face · open-source · 2025-06-06

ScreenSuite - The most comprehensive evaluation suite for GUI Agents!

Score 10

No feed summary available yet.

research evals agents

inference kv-cache benchmark model-release research training evals open-source

High signal Matched: evaluation, agents

Nota AI · korea · 2025-05-07

Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features</span#x3E;

Score 28

model-release evals long-context

High signal Matched: inference, generation, kv cache, benchmark, performance, latency, model, weights, research, training, benchmarks, open-source

Hugging Face · open-source · 2025-04-16

Introducing HELMET: Holistically Evaluating Long-context Language Models

Score 10

No feed summary available yet.

benchmark model-release research training fine-tuning evals rag api frontier-model

High signal Matched: introducing, evaluating, long-context

BAIR · research · 2025-04-11

Defending against Prompt Injection with Structured Queries (StruQ) and Preference Optimization (SecAlign)

Score 10

Recent advances in Large Language Models (LLMs) enable exciting LLM-integrated applications. However, as LLMs have improved, so have the attacks against them. Prompt injection attack is listed as the #1 threat by OWASP to LLM-integrated ap...

inference kernel benchmark model-release research evals

High signal Matched: cost, model, evaluation, training, dpo, fine-tuning, retrieval, api, sota

Nota AI · korea · 2025-04-08

UniForm: A Reuse Attention Mechanism for Efficient Transformers on Resource-Constrained Edge Devices

Score 24

  Seul-Ki Yeom, Ph. D. Research Lead, Nota AI GmbH Tae-Ho KimCTO & Co-Founder, Nota AI   SummaryDelivers real-time AI performance on edge devices such as smartphones, IoT devices, and embedded systems.Introduces a novel "Reus...

High signal Matched: inference, kernel, benchmark, performance, cost, introducing, model, paper, research, benchmarks

SqueezeBits · korea · 2025-02-27

Fits on Chips: Saving LLM Costs Became Easier Than Ever

Score 10

This article introduces Fits on Chips, an LLMOps toolkit for performance evaluation.

benchmark research evals

benchmark research training evals

High signal Matched: performance, evaluation

Nota AI · korea · 2025-02-10

Where do LLMs Encode the Knowledge to Assess the Ambiguity?

Score 16

  Hancheol Park, Ph. D.AI Research Engineer, Nota AI Geonmin Kim, Ph. D.AI Research Engineer, Nota AI   SummaryIn this study, we present a method for detecting ambiguous samples in natural language understanding (NLU) tasks using...

benchmark hardware model-release quantization evals

High signal Matched: performance, paper, research, evaluation, training, evaluate

SqueezeBits · korea · 2025-01-13

[Intel Gaudi] #4. FP8 Quantization

Score 20

In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.

High signal Matched: performance, accelerator, fp8, quantization, evaluate

Hugging Face · open-source · 2025-01-09

CO₂ Emissions and Models Performance: Insights from the Open LLM Leaderboard

Score 10

No feed summary available yet.

benchmark evals

benchmark hardware research evals

High signal Matched: performance, leaderboard

SqueezeBits · korea · 2025-01-06

[Intel Gaudi] #3. Performance Evaluation with SynapseAI v1.19

Score 18

In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.

High signal Matched: performance, accelerator, evaluation, evaluate

Hugging Face · open-source · 2024-12-04

Rethinking LLM Evaluation with 3C3H: AraGen Benchmark and Leaderboard

Score 14

No feed summary available yet.

benchmark research evals

benchmark hardware research evals

High signal Matched: benchmark, evaluation, leaderboard

SqueezeBits · korea · 2024-12-03

[Intel Gaudi] #2. Graph Compiler and Overall Performance Evaluation

Score 18

In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.

High signal Matched: performance, accelerator, evaluation, evaluate

SqueezeBits · korea · 2024-11-21

[Intel Gaudi] #1. Introduction

Score 12

In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.

benchmark hardware evals

High signal Matched: performance, accelerator, evaluate

Hugging Face · open-source · 2024-11-20

Introducing the Open Leaderboard for Japanese LLMs!

Score 10

No feed summary available yet.

research fine-tuning evals

High signal Matched: introducing, leaderboard

Hugging Face · open-source · 2024-11-04

Argilla 2.4: Easily Build Fine-Tuning and Evaluation Datasets on the Hub — No Code Required

Score 10

No feed summary available yet.

High signal Matched: evaluation, fine-tuning

Hugging Face · open-source · 2024-10-04

Introducing the Open FinLLM Leaderboard

Score 10

No feed summary available yet.

inference serving benchmark research evals

High signal Matched: introducing, leaderboard

SqueezeBits · korea · 2024-10-01

[vLLM vs TensorRT-LLM] #1. An Overall Evaluation

Score 22

This article provides a comparative analysis of vLLM and TensorRT-LLM frameworks for serving LLMs, evaluating their performance based on key metrics like throughput, TTFT, and TPOT to offer insights for practitioners in optimizing LLM depl...

High signal Matched: serving, throughput, performance, ttft, tpot, evaluation, evaluating

Modal · inference-infra · 2024-08-05

Beat GPT-4o at Python by searching with 100 dumb LLaMAs

Score 8

Scale up smaller open models with search and evaluation to match frontier capabilities.

inference benchmark model-release research cloud training fine-tuning evals open-source

High signal Matched: evaluation

Nota AI · korea · 2024-08-02

Deploying an Efficient Vision-Language Model on Mobile Devices

Score 38

  Jaeyeon KimResearch Engineer, Nota AI Geonmin KimResearch Engineer, Nota AI Hancheol ParkTeam Lead of NetsPresso Application, Nota AI   IntroductionRecent large language models (LLMs) have demonstrated unprecedented performance...

research fine-tuning evals

High signal Matched: decoding, benchmark, performance, latency, tokens/sec, model, arxiv, research, technical report, evaluation, cloud, training, lora, benchmarks, leaderboard, open-source

Hugging Face · open-source · 2024-07-25

LAVE: Zero-shot VQA Evaluation on Docmatix with LLMs - Do We Still Need Fine-Tuning?

Score 10

No feed summary available yet.

benchmark model-release research evals

High signal Matched: evaluation, fine-tuning

Nota AI · korea · 2024-06-13

Cluster Self-Refinement for Enhanced Online Multi-Camera People Tracking

Score 8

  Jeongho KimResearch Engineer, Nota AI   SummaryOnline multi-camera system for efficient individual trackingAccurate ID management with Cluster Self-Refinement (CSR)Improved performance with enhanced pose estimation   Intro...

High signal Matched: performance, model, paper, research, evaluation, leaderboard

Hugging Face · open-source · 2024-05-24

CyberSecEval 2 - A Comprehensive Evaluation Framework for Cybersecurity Risks and Capabilities of Large Language Models

Score 10

No feed summary available yet.

High signal Matched: evaluation

Hugging Face · open-source · 2024-05-14

Introducing the Open Arabic LLM Leaderboard

Score 10

No feed summary available yet.

High signal Matched: introducing, leaderboard

Hugging Face · open-source · 2024-05-05

Introducing the Open Leaderboard for Hebrew LLMs!

Score 10

No feed summary available yet.

High signal Matched: introducing, leaderboard

Hugging Face · open-source · 2024-05-03

Bringing the Artificial Analysis LLM Performance Leaderboard to Hugging Face

Score 10

No feed summary available yet.

benchmark evals

High signal Matched: performance, leaderboard

Hugging Face · open-source · 2024-04-23

Introducing the Open Chain of Thought Leaderboard

Score 10

No feed summary available yet.

model-release research evals

High signal Matched: introducing, leaderboard

Hugging Face · open-source · 2024-04-16

Introducing the LiveCodeBench Leaderboard - Holistic and Contamination-Free Evaluation of Code LLMs

Score 14

No feed summary available yet.

High signal Matched: introducing, evaluation, leaderboard

Hugging Face · open-source · 2024-02-23

Introducing the Red-Teaming Resistance Leaderboard

Score 10

No feed summary available yet.

model-release research korea evals

High signal Matched: introducing, leaderboard

Hugging Face · open-source · 2024-02-20

Introducing the Open Ko-LLM Leaderboard: Leading the Korean LLM Evaluation Ecosystem

Score 18

No feed summary available yet.

High signal Matched: introducing, evaluation, korean, leaderboard

Hugging Face · open-source · 2024-01-31

Introducing the Enterprise Scenarios Leaderboard: a Leaderboard for Real World Use Cases

Score 10

No feed summary available yet.

High signal Matched: introducing, leaderboard

Hugging Face · open-source · 2022-10-24

Evaluating Language Model Bias with 🤗 Evaluate

Score 10

No feed summary available yet.

High signal Matched: model, evaluate, evaluating

Hugging Face · open-source · 2022-06-28

Announcing Evaluation on the Hub

Score 10

No feed summary available yet.

High signal Matched: evaluation

Mistral AI · model-lab · 2026-06-03

Forge Train, align, and evaluate custom AI models.

Score 5

No feed summary available yet.

Watchlist Matched: evaluate

Anthropic · model-lab · 2026-06-03

Eval awareness in Claude Opus 4.6’s BrowseComp performance

Score 5

No feed summary available yet.

Watchlist Matched: eval

Anthropic · model-lab · 2026-06-03

Demystifying evals for AI agents

Score 5

No feed summary available yet.

Watchlist Matched: evals

Stanford CRFM · research · 2026-06-03

HELM Capabilities: Evaluating LMs Capability by Capability

Score 2

No feed summary available yet.

Watchlist Matched: evaluating

Stanford CRFM · research · 2026-06-03

ThaiExam Leaderboard in HELM

Score 2

No feed summary available yet.

Watchlist Matched: leaderboard

vLLM Project · open-source · 2026-05-11

vLLM Tops the Artificial Analysis Leaderboard

Score 3

How vLLM built the leading deployments of DeepSeek V3.2, MiniMax-M2.5, and Qwen 3.5 397B.

Watchlist Matched: leaderboard

AI2 · research · 2026-05-11

Why Artificial Analysis uses Ai2's IFBench instruction-following eval

Score 0

Artificial Analysis uses Ai2’s open IFBench eval because it captures a stubborn, real-world capability many benchmarks miss: whether models can reliably follow complex, multi-part user instructions.

Watchlist Matched: eval, benchmarks

Hugging Face · open-source · 2026-05-06

Adding Benchmaxxer Repellant to the Open ASR Leaderboard

Score 1

No feed summary available yet.

Watchlist Matched: leaderboard

Together AI · inference-infra · 2026-04-30

Announcing Together AI and Adaption Partnership

Score 3

Together AI and Adaption partner to bring Together Fine-Tuning natively into Adaptive Data, helping teams optimize datasets, run fine-tuning, evaluate results, and deploy stronger open models.

fine-tuning evals

Watchlist Matched: fine-tuning, evaluate

Hugging Face · open-source · 2026-04-21

QIMMA قِمّة ⛰: A Quality-First Arabic LLM Leaderboard

Score 1

No feed summary available yet.

Watchlist Matched: leaderboard

AI2 · research · 2026-04-13

Evaluating agents for scientific discovery

Score 0

Two benchmarks developed at Ai2 – ScienceWorld and DiscoveryWorld – reveal that even incredibly strong AI science agents struggle with problems human scientists solve routinely.

Watchlist Matched: evaluating, benchmarks, agents

Google Research · big-tech · 2026-04-03

Evaluating alignment of behavioral dispositions in LLMs

Score 0

Generative AI

Watchlist Matched: evaluating

Google Research · big-tech · 2026-04-01

Building better AI benchmarks: How many raters are enough?

Score 0

Algorithms & Theory

Watchlist Matched: benchmarks

Hugging Face · open-source · 2026-03-24

A New Framework for Evaluating Voice Agents (EVA)

Score 1

No feed summary available yet.

Watchlist Matched: evaluating, agents

Hugging Face · open-source · 2026-02-12

OpenEnv in Practice: Evaluating Tool-Using Agents in Real-World Environments

Score 1

No feed summary available yet.

Watchlist Matched: evaluating, agents

Hugging Face · open-source · 2026-02-04

Community Evals: Because we're done trusting black-box leaderboards over the community

Score 1

No feed summary available yet.

Watchlist Matched: evals

Hugging Face · open-source · 2026-01-21

AssetOpsBench: Bridging the Gap Between AI Agent Benchmarks and Industrial Reality

Score 1

No feed summary available yet.

Watchlist Matched: benchmarks, agent

Hugging Face · open-source · 2025-11-21

Open ASR Leaderboard: Trends and Insights with New Multilingual & Long-Form Tracks

Score 1

No feed summary available yet.

Watchlist Matched: leaderboard

Together AI · inference-infra · 2025-10-28

Dynamic AI agent testing for the real world with Collinear Simulations and Together Evals

Score 3

Test AI agents in the real world with Collinear TraitMix and Together Evals: dynamic persona simulations, multi-turn dialogs, and LLM-as-judge scoring.

Watchlist Matched: evals, agent, agents

Google Research · big-tech · 2025-08-26

A scalable framework for evaluating health language models

Score 0

Generative AI

Watchlist Matched: evaluating

Hugging Face · open-source · 2025-07-17

Back to The Future: Evaluating AI Agents on Predicting Future Events

Score 0

No feed summary available yet.

Watchlist Matched: evaluating, agents

Together AI · inference-infra · 2025-06-12

From Zero to One: Building An Autonomous and Open Data Scientist Agent from Scratch

Score 3

Build a data scientist agent using Together’s open-source models and Code Interpreter—easy to implement, solid benchmarks, and full code on GitHub.

evals agents open-source

Watchlist Matched: benchmarks, agent, open-source

Hugging Face · open-source · 2025-02-28

Trace & Evaluate your Agent with Arize Phoenix

Score 1

No feed summary available yet.

Watchlist Matched: evaluate, agent

Hugging Face · open-source · 2025-02-14

Fixing Open LLM Leaderboard with Math-Verify

Score 1

No feed summary available yet.

Watchlist Matched: leaderboard

Hugging Face · open-source · 2025-02-10

The Open Arabic LLM Leaderboard 2

Score 1

No feed summary available yet.

Watchlist Matched: leaderboard

Hugging Face · open-source · 2024-12-20

Evaluating Audio Reasoning with Big Bench Audio

Score 1

No feed summary available yet.

Watchlist Matched: evaluating

Modular · inference-infra · 2024-12-19

Evaluating Llama Guard with MAX 24.6 and Hugging Face

Score 1

Evaluating Llama Guard with MAX 24.6 and Hugging Face

Watchlist Matched: evaluating

Replicate · inference-infra · 2024-06-28

Replicate Intelligence #6

Score 6

Google's Gemma2 models, language model leaderboard, tips for Stable Diffusion 3

Watchlist Matched: model, leaderboard

Hugging Face · open-source · 2024-06-06

Launching the Artificial Analysis Text to Image Leaderboard & Arena

Score 1

No feed summary available yet.

Watchlist Matched: leaderboard

Hugging Face · open-source · 2024-04-19

The Open Medical-LLM Leaderboard: Benchmarking Large Language Models in Healthcare

Score 1

No feed summary available yet.

Watchlist Matched: leaderboard

Hugging Face · open-source · 2024-02-02

NPHardEval Leaderboard: Unveiling the Reasoning Abilities of Large Language Models through Complexity Classes and Dynamic Updates

Score 1

No feed summary available yet.

Watchlist Matched: leaderboard

Hugging Face · open-source · 2024-01-29

The Hallucinations Leaderboard, an Open Effort to Measure Hallucinations in Large Language Models

Score 1

No feed summary available yet.

Watchlist Matched: leaderboard

Hugging Face · open-source · 2024-01-26

An Introduction to AI Secure LLM Safety Leaderboard

Score 1

No feed summary available yet.

Watchlist Matched: leaderboard

Hugging Face · open-source · 2024-01-12

A guide to setting up your own Hugging Face leaderboard: an end-to-end example with Vectara's hallucination leaderboard

Score 1

No feed summary available yet.

Watchlist Matched: leaderboard

Hugging Face · open-source · 2023-12-01

Open LLM Leaderboard: DROP deep dive

Score 1

No feed summary available yet.

Watchlist Matched: leaderboard

Hugging Face · open-source · 2023-09-18

Object Detection Leaderboard

Score 1

No feed summary available yet.

Watchlist Matched: leaderboard

Hugging Face · open-source · 2023-06-23

What's going on with the Open LLM Leaderboard?

Score 1

No feed summary available yet.

Watchlist Matched: leaderboard

Hugging Face · open-source · 2022-10-03

Very Large Language Models and How to Evaluate Them

Score 1

No feed summary available yet.