rag - MLSys Blogs

Lambda · cloud · 2026-06-01

Unbox one of NVIDIA's first co-packaged optics switches with us. See why we bet on CPO early.

Score 15

When we design large GPU clusters, the network is no longer a background system. It's part of the compute envelope. At the 800G and NVIDIA GB300 NVL72 scale, the back-end fabric accounts for 86% of networking power in a three-layer cluster...

inference serving distributed benchmark hardware model-release rag agents

Open

High signal Matched: generation, token generation, throughput, infiniband, gpu, model, retrieval, agentic

LMCache · open-source · 2026-04-04

LMCache’s New Architecture Boosts MoE Inference Performance by 10×

Score 34

Modern LLM serving workloads are defined by strict latency requirements, high concurrency, and rapidly growing context lengths. Applications such as multi-turn chat, AI agents, and retrieval-augmented generation continuously build on prior...

inference serving kv-cache moe benchmark rag agents

Open

High signal Matched: inference, serving, decoding, generation, throughput, lmcache, moe, performance, latency, ttft, retrieval-augmented generation, retrieval, agents

BAIR · research · 2026-03-13

Identifying Interactions at Scale for LLMs

Score 18

--> Understanding the behavior of complex machine learning systems, particularly Large Language Models (LLMs), is a critical challenge in modern artificial intelligence. Interpretability research aims to make the decision-making process mo...

inference serving benchmark model-release research training evals long-context rag

Open

High signal Matched: inference, serving, decoding, performance, cost, model, research, training, evaluate, mmlu, long-context, rag

vLLM Project · open-source · 2026-03-10

vLLM Semantic Router v0.2 Athena: ClawOS, Model Refresh, and the System Brain

Score 18

Since v0.1 Iris, vLLM Semantic Router has made a large jump. In one release cycle, the project rebuilt its model stack, expanded routing into safety, semantic caching, memory, retrieval, and...

moe model-release rag

Open

High signal Matched: router, release, model, retrieval

AIBrix · open-source · 2026-03-03

AIBrix v0.6.0 Release: Envoy Sidecar, Mixed LLM Workloads Routing, Routing Profiles, LoRA Delivery & New APIs

Score 28

🚀 AIBrix v0.6.0 Release Today we’re excited to announce AIBrix v0.6.0, a release that expands how you deploy and route inference traffic. Key highlights include: Envoy Sidecar Support – Run Envoy alongside the gateway-plugin without...

inference model-release fine-tuning rag api

Open

High signal Matched: inference, prefill, release, model, lora, rerank, api, openai-compatible

Rebellions · hardware · 2025-12-29

LLM/RAG 기반 몽골 관세청 물품 분류 코드 AI 추천 챗봇

Score 10

Summary Challenge 관세청은 매년 방대한 양의 수출입 신고서를 처리하며, 각 품목에 적합한 HS 코드(Harmonized System Code)를 정확하게 분류해야 하는 업무를... The post LLM/RAG 기반 몽골 관세청 물품 분류 코드 AI 추천 챗봇 appeared first on Rebellions.

korea rag

Open

High signal Matched: rebellions, rag

Nota AI · korea · 2025-12-19

NVIDIA Blackwell; The Impact of NVFP4 For LLM Inference

Score 74

  Seungmin YangEdgeFM Lead, Nota AI On this page ▾ SummaryWith the introduction of NVFP4—a new 4-bit floating point data type in NVIDIA’s Blackwell GPU architecture—LLM inference achieves markedly improved efficiency.Blackwell’s NVFP4...

inference serving kernel cuda distributed benchmark hardware model-release research training quantization evals rag

Open

High signal Matched: inference, serving, decoding, prefill, generation, token generation, throughput, kernel, gemm, cutlass, distributed, benchmark, performance, latency, ttft, tpot, tokens/sec, cost, gpu, blackwell, launch, model, weights, fp8, research, training, post-training, quantization, quantized, awq, benchmarks, mmlu, retrieval

SkyPilot · open-source · 2025-12-02

Batch Inference for Documents with DeepSeek-OCR using a Pool of Workers on any Clouds

Score 10

Scale document OCR batch inference for RAG on multiple clouds and Kubernetes clusters using SkyPilot Pool.

inference rag

Open

High signal Matched: inference, rag

Hugging Face · open-source · 2025-10-01

Introducing RTEB: A New Standard for Retrieval Evaluation

Score 14

No feed summary available yet.

model-release research evals rag

Open

High signal Matched: introducing, evaluation, retrieval

BAIR · research · 2025-04-11

Defending against Prompt Injection with Structured Queries (StruQ) and Preference Optimization (SecAlign)

Score 10

Recent advances in Large Language Models (LLMs) enable exciting LLM-integrated applications. However, as LLMs have improved, so have the attacks against them. Prompt injection attack is listed as the #1 threat by OWASP to LLM-integrated ap...

benchmark model-release research training fine-tuning evals rag api frontier-model

Open

High signal Matched: cost, model, evaluation, training, dpo, fine-tuning, retrieval, api, sota

BAIR · research · 2025-04-08

Repurposing Protein Folding Models for Generation with Latent Diffusion

Score 20

PLAID is a multimodal generative model that simultaneously generates protein 1D sequence and 3D structure, by learning the latent space of protein folding models. The awarding of the 2024 Nobel Prize to AlphaFold2 marks an important moment...

inference benchmark model-release research training rag

Open

High signal Matched: inference, generation, cost, model, weights, research, training, retrieval

SkyPilot · open-source · 2025-02-26

Using DeepSeek R1 for RAG: Do's and Don'ts

Score 10

DeepSeek R1 has shown great reasoning capability when it is firstly released. In this blog post, we detail our learnings in using DeepSeek R1 to build a Retrieval-Augmented Generation (RAG) system, tailored for legal documents. We choose l...

inference research rag

Open

High signal Matched: generation, research, rag, retrieval-augmented generation, retrieval

Hugging Face · open-source · 2024-05-09

Building Cost-Efficient Enterprise RAG applications with Intel Gaudi 2 and Intel Xeon

Score 10

No feed summary available yet.

benchmark rag

Open

High signal Matched: cost, rag

Replicate · inference-infra · 2023-10-17

How to use retrieval augmented generation with ChromaDB and Mistral

Score 10

In this post we'll explore the basics of retrieval augmented generation by creating an example app that uses bge-large-en for embeddings, ChromaDB for vector store, and mistral-7b-instruct for language model generation.

inference model-release rag

Open

High signal Matched: generation, model, retrieval augmented generation, retrieval

Hugging Face · open-source · 2021-02-10

Retrieval Augmented Generation with Huggingface Transformers and Ray

Score 10

No feed summary available yet.

inference rag

Open

High signal Matched: generation, retrieval augmented generation, retrieval

Cohere · model-lab · 2026-06-03

EmbedA leading multimodal search and retrieval tool

Score 3

No feed summary available yet.

rag

Open

Watchlist Matched: retrieval

Hugging Face · open-source · 2026-05-15

Granite Embedding Multilingual R2: Open Apache 2.0 Multilingual Embeddings with 32K Context — Best Sub-100M Retrieval Quality

Score 1

No feed summary available yet.

rag

Open

Watchlist Matched: retrieval

NVIDIA Technical Blog · hardware · 2026-03-24

Building NVIDIA Nemotron 3 Agents for Reasoning, Multimodal RAG, Voice, and Safety

Score 3

Agentic AI is an ecosystem where specialized models work together to handle planning, reasoning, retrieval, and safety guardrailing. As these systems scale,...

rag agents

Open

Watchlist Matched: rag, retrieval, agents, agentic

Google Research · big-tech · 2025-10-08