SkyPilot · open-source · 2026-05-22
RL Doesn't Work on Slurm
Online reinforcement learning for LLMs breaks Slurm's batch scheduling model. We'll discuss why, and what can be done about it.
High signal Matched: model
Open-source cloud orchestration project covering GPU cluster provisioning, multi-cloud training, batch jobs, and cost-aware ML infrastructure.
SkyPilot · open-source · 2026-05-22
Online reinforcement learning for LLMs breaks Slurm's batch scheduling model. We'll discuss why, and what can be done about it.
High signal Matched: model
SkyPilot · open-source · 2026-05-01
We ran hundreds of benchmarks to tune storage systems for distributed training so you don’t have to.
High signal Matched: distributed, training, distributed training, benchmarks
SkyPilot · open-source · 2026-04-22
Introducing GPU Compass: One dashboard to browse, compare pricing, and launch across every GPU cloud.
High signal Matched: gpu, introducing, launch, cloud
SkyPilot · open-source · 2026-04-10
With the SkyPilot Agent Skill, your AI coding agent can launch clusters, run training jobs and manage cloud resources across any infrastructure using natural language.
High signal Matched: launch, cloud, training, agent, agents
SkyPilot · open-source · 2026-04-09
Coding agents working from code alone generate shallow hypotheses. Adding a research phase — arxiv papers, competing forks, other backends — produced 5 kernel fusions that made llama.cpp CPU inference 15% faster.
High signal Matched: inference, kernel, arxiv, research, agent, agents
SkyPilot · open-source · 2026-03-19
Karpathy's autoresearch runs one experiment at a time. We gave it access to our GPU infra and let it run experiments in parallel.
High signal Matched: gpu, agent
SkyPilot · open-source · 2026-03-11
SkyPilot Recipes let you store SkyPilot YAMLs in a shared, team-accessible registry. Launch workloads directly from the CLI without local files.
High signal Matched: launch
SkyPilot · open-source · 2026-02-27
OpenClaw gives an AI agent full access to your system. Here's why you should run it on an isolated cloud VM, and how to set that up.
High signal Matched: cloud, agent
SkyPilot · open-source · 2026-02-21
SkyPilot Admin Policies let you enforce cost controls, security rules, and compliance requirements automatically — without slowing down your engineering team.
High signal Matched: cost, gpu
SkyPilot · open-source · 2026-01-13
Run Meta's SAM3 on large video archives distributed across AWS and Kubernetes clusters with SkyPilot Pools.
High signal Matched: distributed
SkyPilot · open-source · 2025-12-19
SkyPilot now includes predefined templates to launch clusters with popular frameworks and patterns. Deploy fully configured environments without writing long YAMLs.
High signal Matched: launch
SkyPilot · open-source · 2025-12-11
Announcing SkyPilot 0.11 with Pools for batch inference, faster managed jobs, and enterprise-scale improvements.
High signal Matched: inference, cloud
SkyPilot · open-source · 2025-12-02
Scale document OCR batch inference for RAG on multiple clouds and Kubernetes clusters using SkyPilot Pool.
High signal Matched: inference, rag
SkyPilot · open-source · 2025-10-21
AWS Batch works well for traditional enterprise batch processing (see their case studies 1 and 2). But AI workloads have different requirements - they’re more interactive, need flexible GPU access, and benefit from simpler iteration...
High signal Matched: inference, gpu
SkyPilot · open-source · 2025-09-12
SkyPilot now supports detailed GPU metrics across multiple Kubernetes clusters in the dashboard for better observability.
High signal Matched: gpu
SkyPilot · open-source · 2025-09-11
This page has moved. If you are not redirected automatically, click here.
High signal Matched: distributed, training
SkyPilot · open-source · 2025-09-04
How we transformed our fragmented multi-cloud AI infrastructure into a unified system with SkyPilot, achieving 10x faster development cycles.
High signal Matched: cloud
SkyPilot · open-source · 2025-08-21
Avataar's enterprise AI content platform cut costs 11x and unlocked GPU capacity by migrating from inflexible SLURM deployment to SkyPilot's multi-cloud infrastructure.
High signal Matched: gpu, cloud
SkyPilot · open-source · 2025-08-12
Your AI writes code. Now what? If you’re building AI agents in 2025, you probably wondered that as well. Your LLM generates some Python code that analyzes data, manipulates files, or calls APIs. But where does it run? Most people eit...
High signal Matched: cloud, agent, agents, open-source
SkyPilot · open-source · 2025-07-30
There are a lot of discussions happening in AI infrastructure right now. On one side, we have researchers who trained on Slurm in grad school, comfortable with sbatch train_model.sh and the predictability of academic HPC clusters. On the o...
High signal Matched: model, cloud
SkyPilot · open-source · 2025-07-24
Announcing SkyPilot 0.10 - the largest release yet with enterprise-grade features.
High signal Matched: release
SkyPilot · open-source · 2025-07-16
This is Part 2 of our series on the evolution of AI Job Orchestration. In Part 1, we explored how Neoclouds are democratizing GPU access but leaving the “last mile” unsolved. Now we’ll discover how AI-native orchestration...
High signal Matched: infiniband, performance, cost, gpu, cloud
SkyPilot · open-source · 2025-07-08
If you’re an infrastructure or MLOps engineer at a large company, you know the drill. The ML team comes to you with requirements that change weekly. They need GPUs yesterday, but the budget was set six months ago. They want to use th...
High signal Matched: cost, gpu
SkyPilot · open-source · 2025-07-02
Configure high-performance networking on different cloud providers and managed infrastructure with unified SkyPilot's network tier abstraction
High signal Matched: performance, cloud
SkyPilot · open-source · 2025-04-08
Techniques to speed up checkpointing by 9.6x and how to easily achieve them in SkyPilot
High signal Matched: performance, model, cloud, checkpointing
SkyPilot · open-source · 2025-03-20
How to accelerate distributed embedding generation? Use the "forgotten" regions.
High signal Matched: inference, generation, distributed
SkyPilot · open-source · 2025-03-11
Transforming SkyPilot into a scalable, multi-user platform.
High signal Matched: introducing
SkyPilot · open-source · 2025-03-05
SkyPilot uses the venerable SQLite for state management. SQLite can handle millions of QPS, and terabytes of data. However, our efforts to scale our Managed Jobs feature ran up against the one downfall of SQLite: many concurrent writers. S...
High signal Matched: qps
SkyPilot · open-source · 2025-02-26
DeepSeek R1 has shown great reasoning capability when it is firstly released. In this blog post, we detail our learnings in using DeepSeek R1 to build a Retrieval-Augmented Generation (RAG) system, tailored for legal documents. We choose l...
High signal Matched: generation, research, rag, retrieval-augmented generation, retrieval
SkyPilot · open-source · 2024-11-01
For AI teams: How do you efficiently spend $1M+ cloud credits across 3+ clouds?
High signal Matched: cloud
SkyPilot · open-source · 2024-09-16
With last week’s Pixtral release, multimodal large language models (LLMs) like OpenAI’s GPT-4o, Google’s Gemini Pro, and Pixtral are making significant strides. These models are not only able to generate text from images...
High signal Matched: release
SkyPilot · open-source · 2024-07-11
Develop, Train and Serve AI on Kubernetes with SkyPilot.
High signal Matched: serve
SkyPilot · open-source · 2024-02-20
SkyServe: A simple, cost-efficient, multi-region/cloud library for serving GenAI models.
High signal Matched: serving, cost, introducing, cloud
SkyPilot · open-source · 2023-12-21
A tutorial for serving Mixtral 8x7B model with SkyPilot and SkyServe.
High signal Matched: serving, mixtral, cost, gpu, model
SkyPilot · open-source · 2023-09-27
Covariant runs AI on the cloud using SkyPilot, delivering models 4x faster cost-effectively.
High signal Matched: cost, cloud
SkyPilot · open-source · 2023-08-02
An operational guide on finetuning Llama 2, ready for commercial use.
High signal Matched: cloud, finetuning
SkyPilot · open-source · 2023-06-29
SkyPilot makes the deployment and development of vLLM easy and fast on clouds.
High signal Matched: serving, cloud
SkyPilot · open-source · 2023-05-30
Announcing SkyPilot 0.3: LLM support, new clouds, and enhanced production readiness.
High signal Matched: gpu
SkyPilot · open-source · 2023-05-02
Experience report from Salk Institute on how biologists use SkyPilot to conduct research on the cloud.
High signal Matched: research, cloud
SkyPilot · open-source · 2023-03-20
Want to host your own LLM Chatbot on any cloud of your choosing?
High signal Matched: cloud
SkyPilot · open-source · 2022-11-16
Introducing SkyPilot.
High signal Matched: cost, introducing, cloud
SkyPilot · open-source · 2026-03-03
SkyPilot Job Groups let you define heterogeneous RL workloads in a single YAML. Run your PPO trainer on beefy H100s, rollout servers on cheap T4s, and replay buffers on high-memory CPUs, all as one managed job.
Watchlist Matched: none
SkyPilot · open-source · 2026-02-10
Moving from Slurm to Kubernetes doesn't have to mean losing the workflow you know. Here's how SkyPilot brings Slurm-like simplicity to K8s.
Watchlist Matched: none
SkyPilot · open-source · 2026-01-22
Mount Kubernetes PVCs to your clusters for 10-100x faster data access with persistent storage that survives across job lifecycles.
Watchlist Matched: none
SkyPilot · open-source · 2025-12-17
Train a tool-calling agent with VeRL and use SkyPilot to scale it up with independent RL trainer and env rollout
Watchlist Matched: agent
SkyPilot · open-source · 2025-10-14
Want to train an AI agent with RL that can solve math problems or write code? This tutorial walks you through building your own math and coding agents with step-by-step examples with plenty of screenshots to help you along the way. We use...
Watchlist Matched: training, post-training, agent, agents
SkyPilot · open-source · 2025-09-23
How to build production vector search with RedisVL and SkyPilot: 1M documents indexed for $0.85, sub-100ms queries, no Kubernetes required.
Watchlist Matched: none
SkyPilot · open-source · 2025-02-11
SkyPilot enables image-to-image and text-to-image search from 120 Hours to 1 Hour and from $$$ to $
Watchlist Matched: none
SkyPilot · open-source · 2024-11-14
Announcing SkyPilot 0.7.
Watchlist Matched: none
SkyPilot · open-source · 2024-07-23
Operational guide to finetune Llama 3.1, with everything packaged in a simple SkyPilot YAML.
Watchlist Matched: none
SkyPilot · open-source · 2024-06-04
Announcing SkyPilot 0.6.
Watchlist Matched: api