MLSys Radar

rag

Lambda · cloud · 2026-06-01

Unbox one of NVIDIA's first co-packaged optics switches with us. See why we bet on CPO early.

Score 15

When we design large GPU clusters, the network is no longer a background system. It's part of the compute envelope. At the 800G and NVIDIA GB300 NVL72 scale, the back-end fabric accounts for 86% of networking power in a three-layer cluster...

inference serving distributed benchmark hardware model-release rag agents

Open

High signal Matched: generation, token generation, throughput, infiniband, gpu, model, retrieval, agentic

LMCache · open-source · 2026-04-04

LMCache’s New Architecture Boosts MoE Inference Performance by 10×

Score 34

Modern LLM serving workloads are defined by strict latency requirements, high concurrency, and rapidly growing context lengths. Applications such as multi-turn chat, AI agents, and retrieval-augmented generation continuously build on prior...

inference serving kv-cache moe benchmark rag agents

Open

High signal Matched: inference, serving, decoding, generation, throughput, lmcache, moe, performance, latency, ttft, retrieval-augmented generation, retrieval, agents

BAIR · research · 2026-03-13

Identifying Interactions at Scale for LLMs

Score 18

--> Understanding the behavior of complex machine learning systems, particularly Large Language Models (LLMs), is a critical challenge in modern artificial intelligence. Interpretability research aims to make the decision-making process mo...

inference serving benchmark model-release research training evals long-context rag

Open

High signal Matched: inference, serving, decoding, performance, cost, model, research, training, evaluate, mmlu, long-context, rag

AIBrix · open-source · 2026-03-03

AIBrix v0.6.0 Release: Envoy Sidecar, Mixed LLM Workloads Routing, Routing Profiles, LoRA Delivery & New APIs

Score 28

🚀 AIBrix v0.6.0 Release Today we’re excited to announce AIBrix v0.6.0, a release that expands how you deploy and route inference traffic. Key highlights include: Envoy Sidecar Support – Run Envoy alongside the gateway-plugin without...

inference model-release fine-tuning rag api

Open

High signal Matched: inference, prefill, release, model, lora, rerank, api, openai-compatible

Rebellions · hardware · 2025-12-29

LLM/RAG 기반 몽골 관세청 물품 분류 코드 AI 추천 챗봇

Score 10

Summary Challenge 관세청은 매년 방대한 양의 수출입 신고서를 처리하며, 각 품목에 적합한 HS 코드(Harmonized System Code)를 정확하게 분류해야 하는 업무를... The post LLM/RAG 기반 몽골 관세청 물품 분류 코드 AI 추천 챗봇 appeared first on Rebellions.

korea rag

Open

High signal Matched: rebellions, rag

Nota AI · korea · 2025-12-19

NVIDIA Blackwell; The Impact of NVFP4 For LLM Inference

Score 74

  Seungmin YangEdgeFM Lead, Nota AI On this page ▾ SummaryWith the introduction of NVFP4—a new 4-bit floating point data type in NVIDIA’s Blackwell GPU architecture—LLM inference achieves markedly improved efficiency.Blackwell’s NVFP4...

inference serving kernel cuda distributed benchmark hardware model-release research training quantization evals rag

Open

High signal Matched: inference, serving, decoding, prefill, generation, token generation, throughput, kernel, gemm, cutlass, distributed, benchmark, performance, latency, ttft, tpot, tokens/sec, cost, gpu, blackwell, launch, model, weights, fp8, research, training, post-training, quantization, quantized, awq, benchmarks, mmlu, retrieval

BAIR · research · 2025-04-11

Defending against Prompt Injection with Structured Queries (StruQ) and Preference Optimization (SecAlign)

Score 10

Recent advances in Large Language Models (LLMs) enable exciting LLM-integrated applications. However, as LLMs have improved, so have the attacks against them. Prompt injection attack is listed as the #1 threat by OWASP to LLM-integrated ap...

benchmark model-release research training fine-tuning evals rag api frontier-model

Open

High signal Matched: cost, model, evaluation, training, dpo, fine-tuning, retrieval, api, sota

BAIR · research · 2025-04-08

Repurposing Protein Folding Models for Generation with Latent Diffusion

Score 20

PLAID is a multimodal generative model that simultaneously generates protein 1D sequence and 3D structure, by learning the latent space of protein folding models. The awarding of the 2024 Nobel Prize to AlphaFold2 marks an important moment...

inference benchmark model-release research training rag

Open

High signal Matched: inference, generation, cost, model, weights, research, training, retrieval

SkyPilot · open-source · 2025-02-26

Using DeepSeek R1 for RAG: Do's and Don'ts

Score 10

DeepSeek R1 has shown great reasoning capability when it is firstly released. In this blog post, we detail our learnings in using DeepSeek R1 to build a Retrieval-Augmented Generation (RAG) system, tailored for legal documents. We choose l...

inference research rag

Open

High signal Matched: generation, research, rag, retrieval-augmented generation, retrieval