Moreh · korea · 2026-06-03
Optimizing Long-Context Prefill on Multiple (Older-Generation) GPU Nodes
No feed summary available yet.
High signal Matched: prefill, generation, gpu, long-context
Moreh · korea · 2026-06-03
No feed summary available yet.
High signal Matched: prefill, generation, gpu, long-context
NVIDIA Dynamo · open-source · 2026-06-03
No feed summary available yet.
High signal Matched: inference, agentic
VESSL AI · korea · 2026-06-03
No feed summary available yet.
High signal Matched: inference, gpu
Moreh · korea · 2026-06-03
No feed summary available yet.
High signal Matched: inference, distributed
Moreh · korea · 2026-06-03
No feed summary available yet.
High signal Matched: inference, mi300x
NVIDIA Dynamo · open-source · 2026-06-03
No feed summary available yet.
High signal Matched: serving
Mooncake · open-source · 2026-06-03
No feed summary available yet.
High signal Matched: serving
Mooncake · open-source · 2026-06-03
No feed summary available yet.
High signal Matched: serving
Gcore · cloud · 2026-06-03
No feed summary available yet.
High signal Matched: inference, training
Perplexity Research · model-lab · 2026-06-03
No feed summary available yet.
High signal Matched: generation
Perplexity Research · model-lab · 2026-06-03
No feed summary available yet.
High signal Matched: inference
KubeAI · open-source · 2026-06-03
No feed summary available yet.
High signal Matched: generation
Gcore · cloud · 2026-06-03
No feed summary available yet.
High signal Matched: inference
BentoML · inference-infra · 2026-06-03
No feed summary available yet.
High signal Matched: inference, serve, performance, model
TensorRT-LLM · open-source · 2026-06-03
No feed summary available yet.
High signal Matched: generation, distributed
TensorRT-LLM · open-source · 2026-06-03
No feed summary available yet.
High signal Matched: decoding, speculative decoding
BentoML · inference-infra · 2026-06-03
No feed summary available yet.
High signal Matched: inference, serve
TensorRT-LLM · open-source · 2026-06-03
No feed summary available yet.
High signal Matched: decoding
BentoML · inference-infra · 2026-06-03
No feed summary available yet.
High signal Matched: inference
Nebius · cloud · 2026-06-03
No feed summary available yet.
High signal Matched: decoding, speculative decoding, training
Crusoe · cloud · 2026-06-03
No feed summary available yet.
High signal Matched: inference, model
Crusoe · cloud · 2026-06-03
No feed summary available yet.
High signal Matched: inference
FriendliAI · inference-infra · 2026-06-03
No feed summary available yet.
High signal Matched: inference, kv cache, gpu
LightSeek Foundation · research · 2026-06-03
No feed summary available yet.
High signal Matched: inference, decoding, speculative decoding, model, training
FriendliAI · inference-infra · 2026-06-03
No feed summary available yet.
High signal Matched: inference, performance
FuriosaAI · hardware · 2026-06-03
No feed summary available yet.
High signal Matched: inference, generation, agentic
Fireworks AI · inference-infra · 2026-06-03
No feed summary available yet.
High signal Matched: inference, moe
LightSeek Foundation · research · 2026-06-03
No feed summary available yet.
High signal Matched: decoding, speculative decoding, eagle, training
LightSeek Foundation · research · 2026-06-03
No feed summary available yet.
High signal Matched: inference, kernel, performance, agentic
FriendliAI · inference-infra · 2026-06-03
No feed summary available yet.
High signal Matched: inference
Fireworks AI · inference-infra · 2026-06-03
No feed summary available yet.
High signal Matched: inference, model, open model
Baseten · inference-infra · 2026-06-03
No feed summary available yet.
High signal Matched: generation
Mistral AI · model-lab · 2026-06-03
No feed summary available yet.
High signal Matched: inference, training
Cerebras · hardware · 2026-06-03
No feed summary available yet.
High signal Matched: inference
Together AI · inference-infra · 2026-06-02
How Together served MiniMax-M3 efficiently with KV-block-major sparse attention, paged MSA decode, optimized index scoring, and a Rust-based multimodal gateway.
High signal Matched: inference, serving
AWS Machine Learning Blog · cloud · 2026-06-02
GPT-5.5, GPT-5.4, and Codex are now generally available on Amazon Bedrock. Deploy them in production applications and agents today, on Bedrock’s high performance inference engine.
High signal Matched: inference, performance, bedrock, agents
AWS Machine Learning Blog · cloud · 2026-06-02
If you’re iterating on deploying large language models (LLMs) on AWS GPU instances, you’ve probably noticed the larger the model to be loaded into GPU High Bandwidth Memory (HBM), the longer the painful wait until the GPUs are ready for in...
High signal Matched: inference, gpu, model
vLLM Project · open-source · 2026-06-02
We are excited to announce that AutoRound — Intel's state-of-the-art post-training quantization (PTQ) algorithm — is now fully integrated into vLLM-Omni, enabling a streamlined quantize-once,...
High signal Matched: inference, training, post-training, quantization
Lambda · cloud · 2026-06-01
When we design large GPU clusters, the network is no longer a background system. It's part of the compute envelope. At the 800G and NVIDIA GB300 NVL72 scale, the back-end fabric accounts for 86% of networking power in a three-layer cluster...
High signal Matched: generation, token generation, throughput, infiniband, gpu, model, retrieval, agentic
vLLM Project · open-source · 2026-06-01
A technical deep dive on running vLLM on NVIDIA DGX Spark and GB10 systems, covering sm_121 architecture, unified memory behavior, NVFP4 model serving, Nemotron-3-Super configuration, Docker deployment, Prometheus metrics, and local evalua...
High signal Matched: serving, model, evaluation
AWS Machine Learning Blog · cloud · 2026-05-30
This post demonstrates a comprehensive observability solution using Amazon Managed Grafana dashboards that provides a holistic view of both quality and quantity for LLMs served on Amazon SageMaker AI endpoints with inference components.
High signal Matched: inference, gpu, sagemaker
NVIDIA Technical Blog · hardware · 2026-05-29
Modern LLM serving is hard to tune because each deployment is a stack of interacting choices: model backend, tensor-parallel shape, prefill/decode split, worker...
High signal Matched: serving, prefill, model
Nota AI · korea · 2026-05-29
Jaehoon Lee Technical Content Manager, Nota AI When enterprises adopt AI, the most common bottleneck is not model development. It is the deployment stage: getting a finished model to run reliably on the actual target device.T...
High signal Matched: inference, throughput, benchmark, performance, latency, cost, gpu, model, evaluation, quantization, int8, benchmarks, leaderboard
Together AI · inference-infra · 2026-05-29
Together AI built the fastest speech-to-text stack on Artificial Analysis by treating ASR as a full-path systems problem, not just a GPU inference problem.
High signal Matched: inference, gpu
AWS Machine Learning Blog · cloud · 2026-05-29
This post covers Opus 4.8's improvements and practical guidance for AI engineers integrating the model into agentic systems and production inference workloads on Amazon Bedrock.
High signal Matched: inference, model, bedrock, agentic
NVIDIA Technical Blog · hardware · 2026-05-29
AI applications are moving beyond text generation to multimodal systems that can perceive, search, and reason across images, documents, video, and...
High signal Matched: generation
AMD ROCm Blogs · hardware · 2026-05-29
Speculative speculative decoding (SSD) [1] is a recently proposed speculative decoding (SD) algorithm that further accelerates large language model (LLM) inference beyond conventional SD. In standard SD, a small draft model proposes severa...
High signal Matched: inference, decoding, speculative decoding, draft model, verification, cost, mi300x, model
PyTorch Foundation · open-source · 2026-05-28
TL;DR: The TokenSpeed inference engine achieved a record-breaking 580 tps running the Qwen3.5-397B-A17B model on GPUs. This extreme performance for agentic workloads is driven by systematic elimination of memory copies,...
High signal Matched: inference, performance, gpu, model, agentic
vLLM Project · open-source · 2026-05-28
The v0.5.0 release brings significant architectural improvements to speculative decoding model training, introducing DFlash algorithm support, fully unified online training capabilities, and a...
High signal Matched: decoding, speculative decoding, release, introducing, model, training
vLLM Project · open-source · 2026-05-28
Most routing systems start with a prompt and choose a model endpoint. vLLM Semantic Router (VSR) makes a different bet: before a request reaches the serving model, the system should extract...
High signal Matched: serving, endpoint, router, model
vLLM Project · open-source · 2026-05-28
As organizations increasingly adopt AI-powered development tools, the need for high-performance agentic models that deliver both accuracy and operational efficiency has become critical. Laguna...
High signal Matched: inference, performance, agentic
vLLM Project · open-source · 2026-05-28
As post-training workloads continue to scale, we've seen widespread adoption of vLLM as the inference engine of choice. However, two issues repeatedly arise:
High signal Matched: inference, training, post-training
NVIDIA Technical Blog · hardware · 2026-05-27
The cold-start problem In production inference deployments, demand fluctuates over time, requiring inference replicas to scale elastically. However,...
High signal Matched: inference
NVIDIA Technical Blog · hardware · 2026-05-27
Large language models (LLMs) are revolutionizing the financial trading landscape by enabling sophisticated analysis of vast amounts of unstructured data to...
High signal Matched: inference, blackwell
NVIDIA Technical Blog · hardware · 2026-05-27
NVIDIA RTX provides game developers with direct paths to AI-driven characters, frame generation, and ray-traced rendering. This post walks through a meaningful...
High signal Matched: generation
vLLM Project · open-source · 2026-05-26
The EAGLE series — including EAGLE 1, EAGLE 2, and EAGLE 3 — has become one of the most widely adopted and practically deployed families of speculative decoding algorithms across both research and...
High signal Matched: decoding, speculative decoding, eagle, research
AMD ROCm Blogs · hardware · 2026-05-25
Local large language model (LLM) inference has rapidly evolved, but a persistent limitation remains: model size is constrained by available GPU memory. Discrete GPUs typically offer 8–24 GB of dedicated VRAM, which can limit the size of mo...
High signal Matched: inference, multi-gpu, gpu, model, checkpoint, cloud, quantization, evaluate
Lambda · cloud · 2026-05-22
After 15 months of incremental updates, leaks, and rumored leaks, DeepSeek released version 4. It arrived without the fanfare R1 and R1-preview commanded in early 2025. That quiet reception is the most interesting thing about the release....
High signal Matched: inference, serving, performance, cost, release, model, open-source
AMD ROCm Blogs · hardware · 2026-05-22
Triton Inference Server is an open-source platform designed to streamline AI inferencing. It supports the deployment, scaling, and inference of trained models from multiple frameworks, including ONNX Runtime, TensorFlow, PyTorch, and other...
High signal Matched: inference, inferencing, serving, triton, benchmark, model, cloud, open-source
Lambda · cloud · 2026-05-21
The unit of AI compute has shifted from single hosts to rack-scale systems that integrate NVIDIA GPUs, CPUs, scale-up networking fabrics, and liquid cooling, such as the NVIDIA GB300 NVL72 and NVIDIA Vera Rubin NVL72. Teams at the frontier...
High signal Matched: serving, performance, cloud, training, api
Modular · inference-infra · 2026-05-21
Why LLM Inference Needs a New Kind of Router - Part 2
High signal Matched: inference, router
LMCache · open-source · 2026-05-21
A new system stack is quietly taking shape around LLM serving. What makes it interesting is not just how quickly it is evolving, but how familiar the shape of that evolution looks if you’ve spent time studying large-scale systems like the...
High signal Matched: serving, lmcache, api
Lambda · cloud · 2026-05-20
What the numbers mean for financial services Executive summary Lambda is the first to publish an audited STAC-AI™ LANG6 result on NVIDIA HGX B200, with independently verified performance data that Financial Services Industry (FSI) infrastr...
High signal Matched: inference, generation, performance, gpu, h200, b200, model, evaluating
AMD ROCm Blogs · hardware · 2026-05-20
Large Language Models (LLMs) typically contain billions — or even tens of billions — of parameters. During inference, tensor parallelism is commonly employed to distribute the workload across multiple GPUs. This approach demands frequent,...
High signal Matched: inference, latency, introducing, quantization
Together AI · inference-infra · 2026-05-19
Real-world inference benchmarks for coding agents: 31% more TPS than TensorRT-LLM, 2× better TTFT at saturation, and 76% lower cost than Claude Opus 4.6.
High signal Matched: inference, ttft, cost, benchmarks, agents
PyTorch Foundation · open-source · 2026-05-19
TL;DR: Introducing the ExecuTorch MLX Delegate The new MLX delegate enables optimized, GPU-accelerated inference for PyTorch models on Apple Silicon Macs, using Apple’s MLX framework. The delegate seamlessly integrates with...
High signal Matched: inference, gpu, introducing
Modular · inference-infra · 2026-05-18
Hippocratic AI partners with Modular to power flexible, high-quality inference for real-time patient conversations
High signal Matched: inference
vLLM Project · open-source · 2026-05-18
TL;DR: In collaboration with Novita AI, PegaFlow integrates with vLLM as an external KV cache service for LLM inference, implemented as a standalone Rust process and connected through the external...
High signal Matched: inference, kv cache
Together AI · inference-infra · 2026-05-15
Together AI partners with Pearl Research Labs to launch a discounted Pearl-powered inference endpoint for Gemma-4-31B-it-pearl, using Proof of Useful Work to turn AI workloads into crypto emissions.
High signal Matched: inference, endpoint, cost, launch, research
NVIDIA Technical Blog · hardware · 2026-05-14
Agentic inference has fundamentally changed the runtime dynamics of inference workloads by introducing non-deterministic trajectories—actions, observations,...
High signal Matched: inference, introducing, agentic
vLLM Project · open-source · 2026-05-14
Expert parallelism (EP) is a key technique for serving Mixture-of-Experts (MoE) models at high throughput. WideEP deployments (where EP spans many workers) maximize KV cache capacity, enabling...
High signal Matched: serving, throughput, kv cache, moe
NVIDIA Technical Blog · hardware · 2026-05-12
The path from a trained AI model to production should be smooth, but rarely is. Many teams invest weeks fine-tuning models, only to discover that exporting to a...
High signal Matched: serving, model, fine-tuning
Modular · inference-infra · 2026-05-12
Inkwell: Why Your Inference Platform Matters As Much As Your Model
High signal Matched: inference, model
Hugging Face · open-source · 2026-05-12
No feed summary available yet.
High signal Matched: inference, model, training
Nota AI · korea · 2026-05-11
Jaehoon Lee Technical Content Manager, Nota AI NetsPresso® now embraces AI agents. An easy-to-use interface sits on top of the validated pipeline that handles everything from model compression to device deployment.When a user...
High signal Matched: inference, endpoint, kernel, verification, moe, benchmark, latency, cost, gpu, release, model, evaluation, quantization, quantized, int4, evaluate, benchmarks, swe-bench, mmlu, agent, agents, api
Together AI · inference-infra · 2026-05-11
DeepSeek-V4 makes million-token context a serving-systems problem. Together AI explores the inference work behind V4 on NVIDIA HGX B200, including compressed KV layouts, prefix caching, kernel maturity, and endpoint profiles for long-conte...
High signal Matched: inference, serving, endpoint, kernel, b200, long-context
BAIR · research · 2026-05-08
.apr-fig { text-align: center; margin: 1.35em 0; line-height: 1.4; } .apr-fig--wide img { display: inline-block; width: 100%; max-width: 100%; height: auto; vertical-align: middle; } .apr-fig--wide-0-8 { max-width: 80%; margin-left: auto;...
High signal Matched: inference, decoding, prefill, generation, serve, throughput, kv cache, verification, performance, latency, cost, model, paper, research, evaluation, training, pretraining, sft, benchmarks, long context, context window, agentic, reasoning model
NVIDIA Technical Blog · hardware · 2026-05-08
Bash is one of the most flexible and powerful interfaces exposed to AI agents. In the right system, a model that emits grep, curl, tar, or a shell pipeline is...
High signal Matched: decoding, generation, model, agents
Together AI · inference-infra · 2026-05-08
Learn how to deploy any Hugging Face model in one session using Goose and Together's Dedicated Container Inference. Skip the setup complexity — one prompt gets your model running in a production-grade GPU environment on release day.
High signal Matched: inference, gpu, release, model
Modular · inference-infra · 2026-05-08
Why LLM Inference Needs a New Kind of Router - Part 1
High signal Matched: inference, router
NVIDIA Technical Blog · hardware · 2026-05-07
Model quantization is an effective method to reduce VRAM usage and improve inference performance on consumer devices such as NVIDIA GeForce RTX GPUs. By...
High signal Matched: inference, performance, model, training, post-training, quantization
vLLM Project · open-source · 2026-05-06
TL;DR: Agentic workloads generate massive shared prefixes that are often recomputed across turns. By integrating Mooncake's distributed KV cache store into vLLM, we achieve 3.8x higher throughput,...
High signal Matched: serving, throughput, distributed, kv cache, agentic
Together AI · inference-infra · 2026-05-04
As AI moves from research to production, the challenge for AI-native teams shifts from building models to running them — efficiently, reliably, and at scale.
High signal Matched: inference, research
Modal · inference-infra · 2026-05-04
If we've said it once, we've said it once per millisecond: never block the GPU.
High signal Matched: inference, performance, gpu
NVIDIA Technical Blog · hardware · 2026-04-30
Neural network techniques are increasingly used in computer graphics to boost image quality, improve performance, and streamline content creation. Approaches...
High signal Matched: inference, performance
NVIDIA Technical Blog · hardware · 2026-04-30
Today, game developers can begin integrating NVIDIA DLSS 4.5 with Dynamic Multi Frame Generation, Multi Frame Generation 6X, and the second-generation...
High signal Matched: generation
Nota AI · korea · 2026-04-29
Hancheol Park, Ph. D.AI Research Engineer, NetsPresso Tech, Nota AI Geonmin Kim, Ph. D.AI Research Engineer, NetsPresso Tech, Nota AI Geonho LeeEdge AI Engineer Intern, NetsPresso Tech, Nota AI Jaehoon Lee Technical Content Manager,...
High signal Matched: generation, moe, performance, model, weights, paper, research, evaluation, korea, korean, seoul, naver, training, fine-tuning, quantization, agent, agents, agentic
Hugging Face · open-source · 2026-04-29
No feed summary available yet.
High signal Matched: inference
LMCache · open-source · 2026-04-29
For years, we have referred to one of the most critical components of modern LLM inference as a “KV cache.” That name made sense once. Today, it is increasingly misleading. What began as a small, ephemeral optimization inside a...
High signal Matched: inference, kv cache, lmcache
Modal · inference-infra · 2026-04-29
Learn how AE Studio used evolutionary algorithms on Modal to efficiently improve Lean proof generation.
High signal Matched: generation
NVIDIA Technical Blog · hardware · 2026-04-24
DeepSeek just launched its fourth generation of flagship models with DeepSeek-V4-Pro and DeepSeek-V4-Flash, both targeted at enabling highly efficient...
High signal Matched: generation, gpu, blackwell
Together AI · inference-infra · 2026-04-24
Rollout is the silent bottleneck in RL post-training. DAS fixes it with adaptive speculative decoding — up to 50% faster, zero degradation in reward quality.
High signal Matched: decoding, speculative decoding, training, post-training
LMCache · open-source · 2026-04-23
Overview Large language model (LLM) inference performance depends heavily on how efficiently the system manages key-value (KV) cache — the stored attention states that allow the model to avoid recomputing previous tokens. As context length...
High signal Matched: inference, kv cache, lmcache, performance, latency, gpu, model, sagemaker
Nota AI · korea · 2026-04-22
Jaehoon Lee Technical Content Manager, Nota AI Series Notice: NetsPresso® Technical Blog, Part 2In Part 1, we walked through a scenario of deploying Llama 3.2 1B on an edge device to illustrate the NetsPresso® workflow. The f...
High signal Matched: inference, kernel, cuda, matmul, benchmark, performance, latency, cost, npu, model, weights, paper, research, evaluation, furiosa, training, quantization, int8, int4, awq, gptq, sdk, open-source
vLLM Project · open-source · 2026-04-22
Long-context LLM serving is increasingly memory-bound: for standard full-attention decoders, the KV cache often dominates GPU memory at 128k+ contexts, and each decode step must read a large...
High signal Matched: serving, kv cache, gpu, fp8, quantization, long-context
llm-d · open-source · 2026-04-21
How migrating from a simple vLLM deployment to a robust MLOps platform utilizing KServe, llm-d's intelligent routing, and vLLM solved significant scaling and operational challenges in LLM deployment through deep customization and prefix-ca...
High signal Matched: inference, gpu
vLLM Project · open-source · 2026-04-21
Hybrid architectures that interleave Mamba-style SSM layers with standard full-attention (FA) layers — such as NVIDIA Nemotron-H — are gaining traction as a way to combine the linear-time...
High signal Matched: serving
NVIDIA Technical Blog · hardware · 2026-04-20
As LLMs transition from simple text generation to complex reasoning, reinforcement learning (RL) plays a central role. Algorithms like Group Relative Policy...
High signal Matched: generation, throughput, fp8, training
NVIDIA Technical Blog · hardware · 2026-04-17
Coding agents are starting to write production code at scale. Stripe’s agents generate 1,300+ PRs per week. Ramp attributes 30% of merged PRs to agents....
High signal Matched: inference, agents, agentic
LMCache · open-source · 2026-04-16
TL;DR: TurboQuant allows you to put 4x more context in your GPU without blowing up GPU memory or dropping AI’s intelligence. It does so by quantizing the memory of large language models, also known as KV cache, an important bottleneck ment...
High signal Matched: inference, kv cache, lmcache, gpu
SkyPilot · open-source · 2026-04-09
Coding agents working from code alone generate shallow hypotheses. Adding a research phase — arxiv papers, competing forks, other backends — produced 5 kernel fusions that made llama.cpp CPU inference 15% faster.
High signal Matched: inference, kernel, arxiv, research, agent, agents
Nota AI · korea · 2026-04-08
Jaehoon Lee Technical Content Manager, Nota AI AI Model Optimization: Why Models Won't Run on HardwareThe Chip Is Ready, but the Model Won't DeployIf you have ever tried deploying an AI model onto your own chip, the following...
High signal Matched: inference, multi-gpu, kv cache, verification, performance, latency, gpu, model, research, evaluation, quantization, quantized, awq, gptq, evaluate
Modal · inference-infra · 2026-04-08
How Physical Intelligence runs remote, real-time, robotic inference on Modal.
High signal Matched: inference
vLLM Project · open-source · 2026-04-07
TL;DR: Prefill and decode fight over the same GPUs, causing ITL spikes under load. We show how to disaggregate them on a single 8-GPU MI300X node using AMD's MORI-IO connector — achieving 2.5x...
High signal Matched: inference, prefill, itl, gpu, mi300x
LMCache · open-source · 2026-04-04
Modern LLM serving workloads are defined by strict latency requirements, high concurrency, and rapidly growing context lengths. Applications such as multi-turn chat, AI agents, and retrieval-augmented generation continuously build on prior...
High signal Matched: inference, serving, decoding, generation, throughput, lmcache, moe, performance, latency, ttft, retrieval-augmented generation, retrieval, agents
Together AI · inference-infra · 2026-04-03
A four-model video suite for generation, continuation, reference-driven workflows, and editing, rolling out on Together AI starting with text-to-video.
High signal Matched: generation, model
LY Corporation Tech Blog · korea · 2026-04-02
Hello. I’m Inoue, and I work on private cloud infrastructure at LY Corporation.What powers LY Corpor...
High signal Matched: generation, introducing, cloud
NVIDIA Technical Blog · hardware · 2026-04-02
In algorithmic trading, reducing response times to market events is crucial. To keep pace with high-speed electronic markets, latency-sensitive firms often use...
High signal Matched: inference, latency
Together AI · inference-infra · 2026-04-02
Production STT and TTS from Deepgram, available on Together AI Dedicated Model Inference for real-time voice agents.
High signal Matched: inference, model, agents
Nota AI · korea · 2026-03-31
Jaehoon Lee Technical Content Manager, Nota AI In March, a single official announcement from Google Research rocked trillions of won in the market capitalization of U.S. infrastructure and semiconductor stocks. The catalyst:...
High signal Matched: inference, serving, generation, throughput, kv cache, benchmark, performance, cost, b200, blackwell, introducing, model, fp8, research, training, fine-tuning, quantization, quantized, agent, agentic, frontier model
Together AI · inference-infra · 2026-03-31
1.25x over a well-trained static speculator. Aurora is an open-source RL framework that turns speculative decoding from a one-time offline setup into a self-improving system that learns from every request it serves.
High signal Matched: decoding, speculative decoding, open-source
Modal · inference-infra · 2026-03-26
Modal is proud to power real-time inference for Runway Characters.
High signal Matched: inference
Modal · inference-infra · 2026-03-25
How Modal helped the ML team at Doppel parallelize experimentation and scale inference.
High signal Matched: inference
Nota AI · korea · 2026-03-23
Jaehoon Lee Technical Content Manager, Nota AI GTC has evolved far beyond a technology conference, drawing attention from global economies and financial markets alike. This year, CEO Jensen Huang took the stage in his tradema...
High signal Matched: inference, prefill, generation, throughput, cuda, kv cache, performance, latency, cost, gpu, npu, launch, model, research, cloud, training, long-context, context window, agent, agents, agentic, open-source
NVIDIA Technical Blog · hardware · 2026-03-23
As large language model (LLM) inference workloads grow in complexity, a single monolithic serving process starts to hit its limits. Prefill and decode stages...
High signal Matched: inference, serving, prefill, model
Nota AI · korea · 2026-03-20
NP Product Team, Nota AI The role of Edge AI is rapidly expanding.Offline voice assistants now carry on conversations in our daily lives, vehicles infer routes in real time, and smartphones generate images without a network c...
High signal Matched: inference, kv cache, moe, benchmark, performance, latency, cost, model, research, seoul, quantization
Modular · inference-infra · 2026-03-19
Modular 26.2: State-of-the-Art Image Generation and Upgraded AI Coding with Mojo
High signal Matched: generation
Together AI · inference-infra · 2026-03-17
Meet Mamba-3: the SSM built for inference. Faster than Transformers at decode, stronger than Mamba-2, and open-source from day one.
High signal Matched: inference, open-source
NVIDIA Technical Blog · hardware · 2026-03-16
Reasoning models are growing rapidly in size and are increasingly being integrated into agentic AI workflows that interact with other models and external tools....
High signal Matched: inference, multi-node, agentic
NVIDIA Technical Blog · hardware · 2026-03-16
NVIDIA Groq 3 LPX is a new rack-scale inference accelerator for the NVIDIA Vera Rubin platform, designed for the low-latency and large-context demands of...
High signal Matched: inference, latency, accelerator
Together AI · inference-infra · 2026-03-16
Together AI arrives at NVIDIA GTC 2026 with new launches in inference, agents, voice AI, and open models — plus technical sessions from its research and engineering leaders.
High signal Matched: inference, research, agents
Nota AI · korea · 2026-03-13
Hancheol Park, Ph. D. AI Research Engineer, Nota AI Tairen PiaoAI Research Engineer, Nota AI Tae-Ho KimCTO & Co-Founder, Nota AI ✔️ Resource : The official quantized model of Solar-Open-100B, which passed the first round of Sout...
High signal Matched: inference, serving, prefill, generation, throughput, moe, router, benchmark, performance, latency, ttft, tpot, blackwell, release, model, weights, open model, research, evaluation, korea, korean, upstage, training, post-training, quantization, quantized, int4, evaluate, benchmarks, mmlu, long-context
BAIR · research · 2026-03-13
--> Understanding the behavior of complex machine learning systems, particularly Large Language Models (LLMs), is a critical challenge in modern artificial intelligence. Interpretability research aims to make the decision-making process mo...
High signal Matched: inference, serving, decoding, performance, cost, model, research, training, evaluate, mmlu, long-context, rag
NVIDIA Technical Blog · hardware · 2026-03-13
The next generation of AI-driven robots like humanoids and autonomous vehicles depends on high-fidelity, physics-aware training data. Without diverse and...
High signal Matched: generation, training
vLLM Project · open-source · 2026-03-13
EAGLE is the state-of-the-art method for speculative decoding in large language model (LLM) inference, but its autoregressive drafting creates a hidden bottleneck: the more tokens that you...
High signal Matched: inference, decoding, speculative decoding, eagle, model
SqueezeBits · korea · 2026-03-11
Explore why Physical AI deployment needs synthetic data at scale with Squeezebits' research and discover how to overcome inference bottlenecks to accelerate Roboost Agent.
High signal Matched: inference, research, agent
Together AI · inference-infra · 2026-03-11
NVIDIA Nemotron 3 Super is now available on Together AI Dedicated Inference, delivering efficient multi-agent reasoning, a 1M-token context window, and production-grade deployment on managed infrastructure.
High signal Matched: inference, context window, agent
Together AI · inference-infra · 2026-03-05
At AI Native Conf, Together AI announced breakthroughs across kernels, RL, and inference optimization — including FlashAttention-4, ThunderAgent, and together.compile. Research that ships to production. That's the AI Native Cloud.
High signal Matched: inference, flashattention, research, cloud
Together AI · inference-infra · 2026-03-04
Serving long prompts doesn't have to mean slow responses. Learn how Together AI's CPD architecture separates warm and cold inference workloads to deliver 40% higher throughput and dramatically lower time-to-first-token for long-context LLM...
High signal Matched: inference, serving, prefill, throughput, long-context
AIBrix · open-source · 2026-03-03
🚀 AIBrix v0.6.0 Release Today we’re excited to announce AIBrix v0.6.0, a release that expands how you deploy and route inference traffic. Key highlights include: Envoy Sidecar Support – Run Envoy alongside the gateway-plugin without...
High signal Matched: inference, prefill, release, model, lora, rerank, api, openai-compatible
vLLM Project · open-source · 2026-02-27
For a long time, enabling AMD support meant "porting"; i.e. just making code run. That era is over.
High signal Matched: inference, performance, rocm
Nota AI · korea · 2026-02-26
Jewon Lee | Wooksu Shin | Seungmin Yang | Ki-Ung Song | Donguk Lim | Jaeyeon Kim | Tae-Ho Kim | Bo-Kyeong KimEdgeFM Team, Nota AI ✔️ Resources for more information: GitHub, ArXiv, Project Page, Demo.✔️ Accepted at ICLR 2026. &...
High signal Matched: inference, generation, verification, benchmark, performance, latency, cost, model, arxiv, evaluation, training, post-training, benchmarks
Together AI · inference-infra · 2026-02-19
Standard diffusion language models can't use KV caching and need too many refinement steps to be practical. CDLM fixes both with a post-training recipe that enables exact block-wise KV caching and trajectory-consistent step reduction — del...
High signal Matched: inference, latency, training, post-training
Replicate · inference-infra · 2026-02-18
Recraft V4 generates art-directed images — and actual editable SVGs — with strong composition, accurate text rendering, and what the Recraft team calls "design taste." Four models are available on Replicate now.
High signal Matched: generation
Together AI · inference-infra · 2026-02-12
Together AI launches production-grade orchestration for custom AI models with 1.4x–2.6x faster inference.
High signal Matched: inference, introducing
llm-d · open-source · 2026-02-10
llm-d's new filesystem backend offloads KV cache to shared storage, enabling cross-replica reuse and up to 16.8x faster TTFT — scaling inference throughput without GPU or CPU memory limits.
High signal Matched: inference, throughput, kv cache, ttft, gpu
llm-d · open-source · 2026-02-04
llm-d v0.5 introduces hierarchical KV-cache offloading, LoRA-aware scheduling, UCCL networking, and scale-to-zero autoscaling for sustained inference performance at scale.
High signal Matched: inference, performance, lora
vLLM Project · open-source · 2026-02-03
Building on our previous work achieving 2.2k tok/s/H200 decode throughput with wide-EP, the vLLM team has continued performance optimization efforts targeting NVIDIA's GB200 platform. This blog...
High signal Matched: serving, throughput, performance, h200, gb200, blackwell
Together AI · inference-infra · 2026-02-02
Fine-tuned open-source LLM judges can outperform GPT-5.2 at evaluating model outputs. Using Direct Preference Optimization on just 5,400 preference pairs, we trained GPT-OSS 120B to beat GPT-5.2 on human preference alignment—at 15x lower c...
High signal Matched: inference, cost, model, fine-tuning, evaluating, open-source, oss
vLLM Project · open-source · 2026-01-31
Large language model inference has traditionally operated on a simple premise: the user submits a complete prompt (request), the model processes it, and returns a response (either streaming or at...
High signal Matched: inference, model, api
Together AI · inference-infra · 2026-01-26
Introducing DSGym—a holisti evaluation and training framework for LLM-based data science agents. Features 90+ bioinformatics tasks, 92 Kaggle competitions, and synthetic trajectory generation. Our 4B model achieves state-of-the-art perform...
High signal Matched: generation, performance, introducing, model, evaluation, training, evaluating, agents, open-source
Together AI · inference-infra · 2026-01-22
Learn how to reduce inference latency without massive cost using proven inference optimization tactics — improving throughput, GPU utilization, and cost efficiency while balancing throughput vs. latency tradeoffs.
High signal Matched: inference, throughput, latency, cost, gpu
Google Research · big-tech · 2026-01-14
Generative AI
High signal Matched: generation
Together AI · inference-infra · 2026-01-13
Together AI teamed with Cursor to build the real-time inference stack that keeps in-editor agents fast and reliable. They productionized NVIDIA Blackwell (B200/GB200), tuning ARM hosts, kernels, and FP4/TensorRT quantization for low latenc...
High signal Matched: inference, latency, b200, gb200, blackwell, model, quantization, agents
vLLM Project · open-source · 2026-01-08
In this post, we will describe the new KV cache offloading feature that was introduced in vLLM 0.11.0. We will focus on offloading to CPU memory (DRAM) and its benefits to improving overall...
High signal Matched: inference, throughput, kv cache
SqueezeBits · korea · 2026-01-07
A recap of the Intel® Gaudi® hands-on workshop co-hosted by SqueezeBits and Lablup. AI model compression, fine-tuning, and vLLM serving on Gaudi® hardware with Backend.AI.
High signal Matched: serving, model, fine-tuning
SqueezeBits · korea · 2025-12-24
Introducing ATOM™-Max, rebellions’ next-generation NPU designed for high-performance AI inference. Learn how its runtime, profiling tools, and PyTorch-native integrations enable developers to run and serve models efficiently without sacrif...
High signal Matched: inference, generation, serve, performance, npu, introducing, rebellions
Nota AI · korea · 2025-12-19
Seungmin YangEdgeFM Lead, Nota AI On this page ▾ SummaryWith the introduction of NVFP4—a new 4-bit floating point data type in NVIDIA’s Blackwell GPU architecture—LLM inference achieves markedly improved efficiency.Blackwell’s NVFP4...
High signal Matched: inference, serving, decoding, prefill, generation, token generation, throughput, kernel, gemm, cutlass, distributed, benchmark, performance, latency, ttft, tpot, tokens/sec, cost, gpu, blackwell, launch, model, weights, fp8, research, training, post-training, quantization, quantized, awq, benchmarks, mmlu, retrieval
vLLM Project · open-source · 2025-12-17
In v0.11.0, the last code from vLLM V0 engine was removed, marking the complete migration to the improved V1 engine architecture. This achievement would not have been possible without vLLM’s...
High signal Matched: serving, h200
vLLM Project · open-source · 2025-12-15
Modern Large Multimodal Models (LMMs) introduce a unique serving-time bottleneck: before any text generation can begin, all images must be processed by a visual encoder (e.g., ViT). This encoder...
High signal Matched: serving, generation, model
vLLM Project · open-source · 2025-12-13
Efficiently managing request distribution across a fleet of model replicas is a critical requirement for large-scale, production vLLM deployments. Standard load balancers often fall short as they...
High signal Matched: serving, prefill, router, performance, model
vLLM Project · open-source · 2025-12-13
- Speculative decoding serves as an optimization to improve inference performance; however, training a unique draft model for each LLM can be difficult and time-consuming, while production-ready...
High signal Matched: inference, decoding, speculative decoding, draft model, performance, model, training
SkyPilot · open-source · 2025-12-11
Announcing SkyPilot 0.11 with Pools for batch inference, faster managed jobs, and enterprise-scale improvements.
High signal Matched: inference, cloud
vLLM Project · open-source · 2025-12-09
Achieve faster, more efficient LLM serving without sacrificing accuracy!
High signal Matched: serving, quantization
Together AI · inference-infra · 2025-12-03
AutoJudge accelerates LLM inference by identifying which token mismatches actually matter. Using self-supervised learning to train a lightweight classifier, it accepts up to 40 draft tokens per cycle—delivering 1.5–2× speedups over standar...
High signal Matched: inference, decoding, speculative decoding, introducing
SkyPilot · open-source · 2025-12-02
Scale document OCR batch inference for RAG on multiple clouds and Kubernetes clusters using SkyPilot Pool.
High signal Matched: inference, rag
llm-d · open-source · 2025-12-02
llm-d v0.4 delivers 50% lower latency for MoE models via speculative decoding, expands TPU and XPU support, and adds prefix cache offloading for faster TTFT.
High signal Matched: decoding, prefix cache, speculative decoding, moe, performance, latency, ttft, tpu, sota
Together AI · inference-infra · 2025-12-01
Together AI achieves up to 2x faster inference for top open-source models like Qwen, DeepSeek, and Kimi through GPU optimization, advanced speculative decoding, and FP4 quantization—ranking #1 in speed benchmarks on NVIDIA Blackwell archit...
High signal Matched: inference, decoding, speculative decoding, gpu, blackwell, quantization, benchmarks, open-source
vLLM Project · open-source · 2025-11-30
We are excited to announce the official release of vLLM-Omni, a major extension of the vLLM ecosystem designed to support the next generation of AI: omni-modality models.
High signal Matched: serving, generation, release, model
AIBrix · open-source · 2025-11-26
In recent years, large language models (LLMs) such as GPT, DeepSeek, Doubao and Qwen have advanced rapidly and are reshaping a wide range of industries. As the Scaling Law continues to be validated and pushed to its limits, LLM capabilitie...
High signal Matched: inference, serving, generation, throughput, performance, latency, cost
Together AI · inference-infra · 2025-11-25
Production-grade image generation with multi-reference consistency, exact brand colors, and reliable text rendering. FLUX.2 from Black Forest Labs, now on Together AI's platform.
High signal Matched: generation
Hugging Face · open-source · 2025-11-25
No feed summary available yet.
High signal Matched: inference
vLLM Project · open-source · 2025-11-22
Ray now has a new command: ray symmetric-run. This command makes it possible to launch the same entrypoint command on every node in a Ray cluster, simplifying the workflow to spawn vLLM servers...
High signal Matched: serving, multi-node, launch
Modular · inference-infra · 2025-11-20
Modular 25.7: Faster Inference, Safer GPU Programming, and a More Unified Developer Experience
High signal Matched: inference, gpu
Modal · inference-infra · 2025-11-18
Never block the GPU.
High signal Matched: inference, gpu
Modal · inference-infra · 2025-11-13
How Decagon and Modal made real-time voice AI possible, combining fine-tuned small models with a re-engineered inference runtime for sub-second latency.
High signal Matched: inference, latency
AIBrix · open-source · 2025-11-10
🚀 AIBrix v0.5.0 Release Today, we’re excited to announce AIBrix v0.5.0, a release that pushes AIBrix closer to a batteries-included control plane for modern LLM workloads. This release introduces an OpenAI-compatible Batch API for hi...
High signal Matched: prefill, latency, release, evaluation, api, openai-compatible
Together AI · inference-infra · 2025-11-04
Together AI launches the fastest voice AI stack: streaming Whisper STT, serverless open-source TTS (Orpheus & Kokoro), and Voxtral transcription. Sub-second latency for production voice agents.
High signal Matched: inference, latency, agents, open-source
SqueezeBits · korea · 2025-10-31
Explore how the Yetter Inference Engine overcomes the limitations of step caching and model distillation for diffusion models. We analyze latency, diversity, quality, and negative-prompt handling to reveal what truly matters for scalable,...
High signal Matched: inference, generation, latency, model
SqueezeBits · korea · 2025-10-28
Explore how Intel’s new Gaudi-3 compares to Gaudi-2, NVIDIA A100, and H100. We analyze real-world GEMM efficiency, attention performance, and LLM serving results to uncover what truly matters for AI inference and training workloads.
High signal Matched: inference, serving, gemm, performance, h100, training
SkyPilot · open-source · 2025-10-21
AWS Batch works well for traditional enterprise batch processing (see their case studies 1 and 2). But AI workloads have different requirements - they’re more interactive, need flexible GPU access, and benefit from simpler iteration...
High signal Matched: inference, gpu
Together AI · inference-infra · 2025-10-21
Together AI adds 40+ image & video models, including Sora 2 and Veo 3, to build end-to-end multimodal apps with unified OpenAI-compatible APIs and transparent pricing.
High signal Matched: generation, model, openai-compatible
Google Research · big-tech · 2025-10-21
Generative AI
High signal Matched: generation
Together AI · inference-infra · 2025-10-10
LLM inference that gets faster as you use it. Our runtime-learning accelerator adapts continuously to your workload, delivering 500 TPS on DeepSeek-V3.1, a 4x speedup over baseline performance without manual tuning.
High signal Matched: inference, deepseek-v3, performance, accelerator
llm-d · open-source · 2025-10-10
llm-d v0.3 adds Google TPU and Intel XPU support, wide expert parallelism at 2.2k tokens/sec per GPU, predicted latency scheduling, and Inference Gateway GA.
High signal Matched: inference, latency, tokens/sec, gpu, tpu
Google Research · big-tech · 2025-10-03
Generative AI
High signal Matched: generation
SqueezeBits · korea · 2025-10-02
Meet 'Yetter': the generative AI API service built for speed, efficiency, and scalability. Powered by our optimization inference engine, it delivers reliable image, video, and future LLM services at a fraction of the cost.
High signal Matched: inference, cost, api
llm-d · open-source · 2025-09-24
See how llm-d's precise KV-cache aware scheduling delivers 57x faster responses and 2x throughput in production distributed LLM inference benchmarks.
High signal Matched: inference, throughput, distributed, benchmarks
Hugging Face · open-source · 2025-09-19
No feed summary available yet.
High signal Matched: inference
Hugging Face · open-source · 2025-09-17
No feed summary available yet.
High signal Matched: inference
SqueezeBits · korea · 2025-09-16
The guide to LLM guided decoding! This deep-dive benchmark compares XGrammar and LLGuidance on vLLM and SGLang to help you find the optimal setup for generating structured output based on your use case.
High signal Matched: decoding, benchmark, performance
Together AI · inference-infra · 2025-09-15
Our new Batch Inference API makes large-scale AI workloads simpler, faster, and cheaper. With a streamlined UI, universal model support, and 3000× higher rate limits—now up to 30B tokens—you can process massive datasets at half the cost of...
High signal Matched: inference, cost, model, api
Google Research · big-tech · 2025-09-12
Generative AI
High signal Matched: inference
Together AI · inference-infra · 2025-09-09
Together AI launches Instant Clusters: self-service GPU clusters with NVIDIA H100/B200, ready in minutes for training or inference at any scale.
High signal Matched: inference, gpu, h100, b200, training
Replicate · inference-infra · 2025-09-08
Cache your compiled models for faster boot and inference times
High signal Matched: inference
llm-d · open-source · 2025-09-03
Learn how llm-d's intelligent inference scheduling uses prefix-aware, load-balanced routing to maximize LLM throughput and minimize latency on Kubernetes.
High signal Matched: inference, throughput, latency
SqueezeBits · korea · 2025-08-26
In this article, we introduce how to run LLMs efficiently on Apple Silicon with disaggregated inference technique.
High signal Matched: inference, prefill, gpu, npu
Together AI · inference-infra · 2025-08-21
Build AI agents for complex, long-running engineering tasks. Learn key patterns from a case study: accelerating LLM inference with speculative decoding.
High signal Matched: inference, decoding, speculative decoding, agents
AIBrix · open-source · 2025-08-05
AIBrix is a composable, cloud‑native LLM inference infrastructure designed to deliver high performance and low cost at scale. We now present a major update in a new release - v0.4.0. This release tackles key bottlenecks in orchestration an...
High signal Matched: inference, prefill, generation, token generation, throughput, performance, cost, gpu, release, cloud
Modular · inference-infra · 2025-08-05
Modular Platform 25.5: Introducing Large Scale Batch Inference
High signal Matched: inference, introducing
SqueezeBits · korea · 2025-08-04
Trimming large multilingual vocabularies in Small Language Models (SLM) is a simple, low-risk way to boost efficiency to its limit. It accelerates the model inference significantly while keeping accuracy almost unchanged.
High signal Matched: inference, model
Modular · inference-infra · 2025-07-31
SF Compute and Modular Partner to Revolutionize AI Inference Economics
High signal Matched: inference
Hugging Face · open-source · 2025-07-23
No feed summary available yet.
High signal Matched: inference, lora
Together AI · inference-infra · 2025-07-17
Together AI inference is now among the world’s fastest, most capable platforms for running open-source reasoning models like DeepSeek-R1 at scale, thanks to our new inference engine designed for NVIDIA HGX B200.
High signal Matched: inference, b200, blackwell, open-source
Nota AI · korea · 2025-07-10
Marcel Simon, Ph. D.ML Researcher, Nota AI GmbH Tae-Ho KimCTO & Co-Founder, Nota AI Seul-Ki Yeom, Ph. D.Research Lead, Nota AI GmbH SummaryProposes a simple next-frame prediction task using unlabeled video to enhance sing...
High signal Matched: inference, performance, model, paper, research, training, fine-tuning, benchmarks
Hugging Face · open-source · 2025-07-10
No feed summary available yet.
High signal Matched: inference
Modal · inference-infra · 2025-07-02
There's only one playbook for improving generative applications. Read about it here.
High signal Matched: inference, evals
BAIR · research · 2025-07-01
.modal { display: none; position: fixed; z-index: 9999; padding-top: 50px; left: 0; top: 0; width: 100%; height: 100%; overflow: auto; background-color: rgba(0,0,0,0.9); } .modal-content { margin: auto; display: block; max-width: 90%; max-...
High signal Matched: inference, generation, performance, model, paper, arxiv, evaluation, training, evaluate, agent, agents
llm-d · open-source · 2025-06-25
Help shape llm-d's future: Take our 5-minute community survey, subscribe to our YouTube channel, and access exclusive resources for LLM serving innovation.
High signal Matched: serving
Hugging Face · open-source · 2025-06-16
No feed summary available yet.
High signal Matched: inference
Hugging Face · open-source · 2025-06-12
No feed summary available yet.
High signal Matched: inference
Hugging Face · open-source · 2025-06-04
No feed summary available yet.
High signal Matched: generation
llm-d · open-source · 2025-06-03
llm-d hits 1000 GitHub stars! Week 1-2 round-up covers KVTransfer Protocol, InferenceModel API updates, and community resources for LLM inference developers.
High signal Matched: inference, api
Replicate · inference-infra · 2025-05-22
Google's flagship image generation model, Imagen 4, is now available for you to try on Replicate. Create images with fine detail, versatile styles, and improved typography.
High signal Matched: generation, model
AIBrix · open-source · 2025-05-22
AIBrix is a composable, cloud-native AI infrastructure toolkit designed to power scalable and cost-effective large language model (LLM) inference. As production demands for memory-efficient and latency-aware LLM services continue to grow,...
High signal Matched: inference, prefix cache, latency, cost, release, model, cloud
llm-d · open-source · 2025-05-20
Introducing llm-d: Kubernetes-native distributed LLM inference with KV-cache routing, disaggregated serving, and SOTA performance per dollar. Built on vLLM.
High signal Matched: inference, serving, distributed, performance, introducing, sota
llm-d · open-source · 2025-05-20
Red Hat launches llm-d: Open source distributed AI inference platform backed by NVIDIA, Google Cloud, IBM. Scale generative AI with intelligent routing on Kubernetes.
High signal Matched: inference, distributed, release, cloud, open source
Hugging Face · open-source · 2025-05-13
No feed summary available yet.
High signal Matched: inference
Together AI · inference-infra · 2025-05-12
No feed summary available yet.
High signal Matched: decoding, speculative decoding
Nota AI · korea · 2025-05-07
Jewon Lee | Ki-Ung Song | Seungmin Yang | Donguk Lim | Jaeyeon Kim | Wooksu Shin | Bo-Kyeong Kim | Tae-Ho KimEdgeFM Team, Nota AI Yong Jae Lee, Ph. D.Associate Professor, UW-Madison SummaryOur method, Trimmed-Llama, reduces t...
High signal Matched: inference, generation, kv cache, benchmark, performance, latency, model, weights, research, training, benchmarks, open-source
Together AI · inference-infra · 2025-05-05
No feed summary available yet.
High signal Matched: inference
Modal · inference-infra · 2025-04-24
Modal + Daily + Pipecat is the best-in-class infra stack for real-time inference pipelines.
High signal Matched: inference
Hugging Face · open-source · 2025-04-16
No feed summary available yet.
High signal Matched: prefill, performance
Hugging Face · open-source · 2025-04-16
No feed summary available yet.
High signal Matched: inference
BAIR · research · 2025-04-08
PLAID is a multimodal generative model that simultaneously generates protein 1D sequence and 3D structure, by learning the latent space of protein folding models. The awarding of the 2024 Nobel Prize to AlphaFold2 marks an important moment...
High signal Matched: inference, generation, cost, model, weights, research, training, retrieval
Nota AI · korea · 2025-04-08
Seul-Ki Yeom, Ph. D. Research Lead, Nota AI GmbH Tae-Ho KimCTO & Co-Founder, Nota AI SummaryDelivers real-time AI performance on edge devices such as smartphones, IoT devices, and embedded systems.Introduces a novel "Reus...
High signal Matched: inference, kernel, benchmark, performance, cost, introducing, model, paper, research, benchmarks
SqueezeBits · korea · 2025-04-02
This article discusses inference efficiency when running the FLUX.1 models on Intel Gaudi-2 hardware.
High signal Matched: inference
Hugging Face · open-source · 2025-03-28
No feed summary available yet.
High signal Matched: inference
Hugging Face · open-source · 2025-03-21
No feed summary available yet.
High signal Matched: inference
SkyPilot · open-source · 2025-03-20
How to accelerate distributed embedding generation? Use the "forgotten" regions.
High signal Matched: inference, generation, distributed
AIBrix · open-source · 2025-03-10
This blog post introduces deploying DeepSeek R1 using AIBrix. DeepSeek-R1 demonstrates remarkable proficiency in reasoning tasks through step-by-step training process. It features 671B total parameters with 37B active parameters, and 128k...
High signal Matched: inference, distributed, benchmark, model, weights, training, context length
Hugging Face · open-source · 2025-03-07
No feed summary available yet.
High signal Matched: inference
Replicate · inference-infra · 2025-03-05
Wan2.1 is the most capable open-source video generation model, producing coherent and high-quality outputs. Learn how to run it in the cloud with a single line of code.
High signal Matched: generation, model, cloud, api, open-source
SkyPilot · open-source · 2025-02-26
DeepSeek R1 has shown great reasoning capability when it is firstly released. In this blog post, we detail our learnings in using DeepSeek R1 to build a Retrieval-Augmented Generation (RAG) system, tailored for legal documents. We choose l...
High signal Matched: generation, research, rag, retrieval-augmented generation, retrieval
Nota AI · korea · 2025-02-25
Hancheol Park, Ph. D.AI Research Engineer, Nota AI Geonmin Kim, Ph. D.AI Research Engineer, Nota AI Jaeyeon KimAI Research Engineer, Nota AI SummaryIn this study, we propose a method for determining whether given multilingual...
High signal Matched: generation, performance, model, paper, research, training, fine-tuning
Hugging Face · open-source · 2025-02-24
No feed summary available yet.
High signal Matched: inference, decoding
AIBrix · open-source · 2025-02-21
Open-source large language models (LLMs) like LLaMA, Deepseek, Qwen and Mistral etc have surged in popularity, offering enterprises greater flexibility, cost savings, and control over their AI deployments. These models have empowered organ...
High signal Matched: inference, generation, latency, cost, introducing, model, agents, open-source
AIBrix · open-source · 2025-02-19
We’re excited to announce the v0.2.0 release of AIBrix! Building on feedback from v0.1.0 production adoption and user interest, this release introduces several new features to enhance performance and usability. Extend the vLLM Prefix...
High signal Matched: inference, serving, prefill, throughput, distributed, multi-node, kv cache, prefix cache, performance, cost, gpu, accelerator, release, agent
Hugging Face · open-source · 2025-02-18
No feed summary available yet.
High signal Matched: inference, introducing
Hugging Face · open-source · 2025-02-12
No feed summary available yet.
High signal Matched: generation
Hugging Face · open-source · 2025-01-28
No feed summary available yet.
High signal Matched: inference
Hugging Face · open-source · 2025-01-27
No feed summary available yet.
High signal Matched: generation
SqueezeBits · korea · 2025-01-20
This article provides a comparative analysis of serving vision-language models on vLLM and TensorRT-LLM.
High signal Matched: serving
Hugging Face · open-source · 2025-01-16
No feed summary available yet.
High signal Matched: inference, generation, introducing
Hugging Face · open-source · 2024-12-23
No feed summary available yet.
High signal Matched: generation, model
Hugging Face · open-source · 2024-12-18
No feed summary available yet.
High signal Matched: inference, model
SqueezeBits · korea · 2024-12-09
This article provides a comparative analysis of speculative decoding.
High signal Matched: decoding, speculative decoding
Hugging Face · open-source · 2024-12-09
No feed summary available yet.
High signal Matched: generation
SqueezeBits · korea · 2024-12-05
This article provides a comparative analysis of multi-LoRA serving capabilities of vLLM and TensorRT-LLM frameworks.
High signal Matched: serving, lora
Hugging Face · open-source · 2024-11-20
No feed summary available yet.
High signal Matched: decoding, generation, speculative decoding
AIBrix · open-source · 2024-11-13
In recent years, large language models (LLMs) have revolutionized AI applications, powering solutions in areas like chatbots, automated content generation, and advanced recommendation engines. Services like OpenAI’s have gained significant...
High signal Matched: decoding, prefill, generation, kv cache, performance, cost, gpu, release, introducing, cloud, open-source
Hugging Face · open-source · 2024-10-29
No feed summary available yet.
High signal Matched: decoding, generation, model
Hugging Face · open-source · 2024-10-22
No feed summary available yet.
High signal Matched: generation
SqueezeBits · korea · 2024-10-11
This article provides a comparative analysis of vLLM and TensorRT-LLM frameworks, focusing on batching configurations and thoroughly examining the effects of maximum batch size and maximum number of tokens.
High signal Matched: serving
Hugging Face · open-source · 2024-10-08
No feed summary available yet.
High signal Matched: generation
Replicate · inference-infra · 2024-10-03
Black Forest Labs continue to push boundaries with their latest release of FLUX.1 image generation model.
High signal Matched: generation, release, model
SqueezeBits · korea · 2024-10-01
This article provides a comparative analysis of vLLM and TensorRT-LLM frameworks for serving LLMs, evaluating their performance based on key metrics like throughput, TTFT, and TPOT to offer insights for practitioners in optimizing LLM depl...
High signal Matched: serving, throughput, performance, ttft, tpot, evaluation, evaluating
Modal · inference-infra · 2024-09-16
Learn how we used our new dynamic batching feature to improve throughput and reduce inference costs for the Whisper model with a single line of code!
High signal Matched: inference, throughput, model
Nota AI · korea · 2024-08-02
Jaeyeon KimResearch Engineer, Nota AI Geonmin KimResearch Engineer, Nota AI Hancheol ParkTeam Lead of NetsPresso Application, Nota AI IntroductionRecent large language models (LLMs) have demonstrated unprecedented performance...
High signal Matched: decoding, benchmark, performance, latency, tokens/sec, model, arxiv, research, technical report, evaluation, cloud, training, lora, benchmarks, leaderboard, open-source
Hugging Face · open-source · 2024-07-29
No feed summary available yet.
High signal Matched: inference
Hugging Face · open-source · 2024-06-18
No feed summary available yet.
High signal Matched: generation
Replicate · inference-infra · 2024-06-14
Create your own custom version of Stability's latest image generation model and run it on Replicate via the web or API.
High signal Matched: generation, model, api
Hugging Face · open-source · 2024-06-04
No feed summary available yet.
High signal Matched: generation
Hugging Face · open-source · 2024-05-29
No feed summary available yet.
High signal Matched: inference, generation
Hugging Face · open-source · 2024-05-16
No feed summary available yet.
High signal Matched: generation, quantization
Hugging Face · open-source · 2024-05-01
No feed summary available yet.
High signal Matched: inference, decoding, speculative decoding
Hugging Face · open-source · 2024-04-29
No feed summary available yet.
High signal Matched: generation
Hugging Face · open-source · 2024-04-03
No feed summary available yet.
High signal Matched: inference
Hugging Face · open-source · 2024-04-02
No feed summary available yet.
High signal Matched: inference, gpu
Hugging Face · open-source · 2024-02-29
No feed summary available yet.
High signal Matched: generation, accelerator
Modal · inference-infra · 2024-02-21
Find out how Suno uses Modal to scale inference and batch pre-processing to thousands of GPUs.
High signal Matched: inference, launch
SkyPilot · open-source · 2024-02-20
SkyServe: A simple, cost-efficient, multi-region/cloud library for serving GenAI models.
High signal Matched: serving, cost, introducing, cloud
Hugging Face · open-source · 2024-02-01
No feed summary available yet.
High signal Matched: inference, generation
Hugging Face · open-source · 2024-01-30
No feed summary available yet.
High signal Matched: decoding, speculative decoding
Replicate · inference-infra · 2024-01-30
Code Llama 70B is one of the powerful open-source code generation models. Learn how to run it in the cloud with one line of code.
High signal Matched: generation, cloud, api, open-source
Hugging Face · open-source · 2024-01-15
No feed summary available yet.
High signal Matched: inference
Hugging Face · open-source · 2024-01-04
No feed summary available yet.
High signal Matched: generation
SkyPilot · open-source · 2023-12-21
A tutorial for serving Mixtral 8x7B model with SkyPilot and SkyServe.
High signal Matched: serving, mixtral, cost, gpu, model
Hugging Face · open-source · 2023-12-20
No feed summary available yet.
High signal Matched: inference, decoding, speculative decoding
Hugging Face · open-source · 2023-12-05
No feed summary available yet.
High signal Matched: inference
Hugging Face · open-source · 2023-12-05
No feed summary available yet.
High signal Matched: inference, lora
Hugging Face · open-source · 2023-11-07
No feed summary available yet.
High signal Matched: generation
Hugging Face · open-source · 2023-10-24
No feed summary available yet.
High signal Matched: inference
Replicate · inference-infra · 2023-10-17
In this post we'll explore the basics of retrieval augmented generation by creating an example app that uses bge-large-en for embeddings, ChromaDB for vector store, and mistral-7b-instruct for language model generation.
High signal Matched: generation, model, retrieval augmented generation, retrieval
Hugging Face · open-source · 2023-10-03
No feed summary available yet.
High signal Matched: inference, tpu, cloud
Hugging Face · open-source · 2023-10-02
No feed summary available yet.
High signal Matched: inference, api
Hugging Face · open-source · 2023-09-22
No feed summary available yet.
High signal Matched: inference
Hugging Face · open-source · 2023-09-13
No feed summary available yet.
High signal Matched: generation, introducing
Hugging Face · open-source · 2023-09-08
No feed summary available yet.
High signal Matched: generation
Hugging Face · open-source · 2023-08-04
No feed summary available yet.
High signal Matched: inference
Hugging Face · open-source · 2023-08-01
No feed summary available yet.
High signal Matched: generation
Hugging Face · open-source · 2023-07-17
No feed summary available yet.
High signal Matched: generation, open-source
Hugging Face · open-source · 2023-07-04
No feed summary available yet.
High signal Matched: inference
SkyPilot · open-source · 2023-06-29
SkyPilot makes the deployment and development of vLLM easy and fast on clouds.
High signal Matched: serving, cloud
Hugging Face · open-source · 2023-05-31
No feed summary available yet.
High signal Matched: inference, introducing, sagemaker
Hugging Face · open-source · 2023-05-23
No feed summary available yet.
High signal Matched: generation
Hugging Face · open-source · 2023-05-11
No feed summary available yet.
High signal Matched: generation, latency
Hugging Face · open-source · 2023-03-28
No feed summary available yet.
High signal Matched: inference, accelerator
Hugging Face · open-source · 2023-03-28
No feed summary available yet.
High signal Matched: inference
Hugging Face · open-source · 2023-02-15
No feed summary available yet.
High signal Matched: generation
Hugging Face · open-source · 2023-02-15
No feed summary available yet.
High signal Matched: inference
Hugging Face · open-source · 2023-01-26
No feed summary available yet.
High signal Matched: generation
Hugging Face · open-source · 2023-01-20
No feed summary available yet.
High signal Matched: generation
Hugging Face · open-source · 2022-12-14
No feed summary available yet.
High signal Matched: inference, training
Hugging Face · open-source · 2022-11-21
No feed summary available yet.
High signal Matched: inference
Hugging Face · open-source · 2022-10-14
No feed summary available yet.
High signal Matched: inference
Hugging Face · open-source · 2022-10-12
No feed summary available yet.
High signal Matched: inference
Hugging Face · open-source · 2022-09-16
No feed summary available yet.
High signal Matched: inference
Hugging Face · open-source · 2022-08-11
No feed summary available yet.
High signal Matched: serving
Hugging Face · open-source · 2022-07-27
No feed summary available yet.
High signal Matched: generation
Hugging Face · open-source · 2022-07-25
No feed summary available yet.
High signal Matched: serving
Hugging Face · open-source · 2022-05-10
No feed summary available yet.
High signal Matched: inference
Hugging Face · open-source · 2022-03-16
No feed summary available yet.
High signal Matched: inference
Hugging Face · open-source · 2022-03-11
No feed summary available yet.
High signal Matched: generation
Hugging Face · open-source · 2022-01-11
No feed summary available yet.
High signal Matched: inference, sagemaker
Hugging Face · open-source · 2021-11-04
No feed summary available yet.
High signal Matched: inference, model
Hugging Face · open-source · 2021-06-03
No feed summary available yet.
High signal Matched: inference, api
Hugging Face · open-source · 2021-04-20
No feed summary available yet.
High signal Matched: inference
Hugging Face · open-source · 2021-02-10
No feed summary available yet.
High signal Matched: generation, retrieval augmented generation, retrieval
Hugging Face · open-source · 2021-01-18
No feed summary available yet.
High signal Matched: inference, api
Hugging Face · open-source · 2020-03-01
No feed summary available yet.
High signal Matched: decoding, generation
Replicate · inference-infra · 2026-02-24
Seedream 5.0 brings multi-step reasoning, example-based editing, and deep domain knowledge to image generation. Here's what you should know.
Watchlist Matched: generation
Replicate · inference-infra · 2025-11-25
FLUX.2 brings professional-grade image generation and editing with unprecedented detail, multi-reference support, and enterprise efficiency.
Watchlist Matched: generation
Replicate · inference-infra · 2025-11-20
Nano Banana Pro brings powerful new capabilities in image generation and editing. Here are the main prompt tricks you should know.
Watchlist Matched: generation
Replicate · inference-infra · 2025-10-16
Google's Veo 3.1 brings powerful new video generation capabilities including reference images, first/last frame control, and enhanced image-to-video. Here's everything you need to know.
Watchlist Matched: generation
Replicate · inference-infra · 2025-07-17
We've partnered with Bria to bring a suite of commercial-grade image generation and editing models to Replicate. Built entirely on licensed data, Bria’s tools are designed for enterprises and developers building safely with visual AI.
Watchlist Matched: generation
Replicate · inference-infra · 2025-05-15
We've partnered with Hugging Face to bring Replicate inference to their platform.
Watchlist Matched: inference
Replicate · inference-infra · 2024-11-21
A new set of image generation capabilities for FLUX models, including inpainting, outpainting, canny edge detection, and depth maps.
Watchlist Matched: generation
Replicate · inference-infra · 2024-08-15
We've added fine-tuning (LoRA) support to FLUX.1 image generation models. You can train FLUX.1 on your own images with one line of code using Replicate's API.
Watchlist Matched: generation, fine-tuning, lora, api
Replicate · inference-infra · 2024-07-12
Data curation, data generation, data data data
Watchlist Matched: generation
Replicate · inference-infra · 2024-06-07
Garden State Llama, applied LLMs guide, real-time image generation
Watchlist Matched: generation
Replicate · inference-infra · 2024-05-31
Faster image generation, AI-powered world simulator, insights on AI dataset complexity
Watchlist Matched: generation