MLSys Radar

SkyPilot

Open-source cloud orchestration project covering GPU cluster provisioning, multi-cloud training, batch jobs, and cost-aware ML infrastructure.

Country
Unknown
Category
open-source
Blog
https://blog.skypilot.co/
Feed
https://blog.skypilot.co/index.xml
Feed discovery status
known

SkyPilot · open-source · 2026-05-22

RL Doesn't Work on Slurm

Score 8

Online reinforcement learning for LLMs breaks Slurm's batch scheduling model. We'll discuss why, and what can be done about it.

model-release

Open

High signal Matched: model

SkyPilot · open-source · 2026-02-27

Don't Run OpenClaw on Your Main Machine

Score 8

OpenClaw gives an AI agent full access to your system. Here's why you should run it on an isolated cloud VM, and how to set that up.

cloud agents

Open

High signal Matched: cloud, agent

SkyPilot · open-source · 2025-08-12

Self-host open-source LLM agent sandbox on your own cloud

Score 10

Your AI writes code. Now what? If you’re building AI agents in 2025, you probably wondered that as well. Your LLM generates some Python code that analyzes data, manipulates files, or calls APIs. But where does it run? Most people eit...

cloud agents open-source

Open

High signal Matched: cloud, agent, agents, open-source

SkyPilot · open-source · 2025-07-16

The Evolution of AI Job Orchestration. Part 2: The AI-Native Control Plane & Orchestration that Finally Works for ML

Score 16

This is Part 2 of our series on the evolution of AI Job Orchestration. In Part 1, we explored how Neoclouds are democratizing GPU access but leaving the “last mile” unsolved. Now we’ll discover how AI-native orchestration...

distributed benchmark hardware cloud

Open

High signal Matched: infiniband, performance, cost, gpu, cloud

SkyPilot · open-source · 2025-03-05

Abusing SQLite to Handle Concurrency

Score 8

SkyPilot uses the venerable SQLite for state management. SQLite can handle millions of QPS, and terabytes of data. However, our efforts to scale our Managed Jobs feature ran up against the one downfall of SQLite: many concurrent writers. S...

benchmark

Open

High signal Matched: qps

SkyPilot · open-source · 2025-02-26

Using DeepSeek R1 for RAG: Do's and Don'ts

Score 10

DeepSeek R1 has shown great reasoning capability when it is firstly released. In this blog post, we detail our learnings in using DeepSeek R1 to build a Retrieval-Augmented Generation (RAG) system, tailored for legal documents. We choose l...

inference research rag

Open

High signal Matched: generation, research, rag, retrieval-augmented generation, retrieval

SkyPilot · open-source · 2026-03-03

SkyPilot Job Groups: Run RL on Heterogenous Hardware

Score 1

SkyPilot Job Groups let you define heterogeneous RL workloads in a single YAML. Run your PPO trainer on beefy H100s, rollout servers on cheap T4s, and replay buffers on high-memory CPUs, all as one managed job.

Open

Watchlist Matched: none

SkyPilot · open-source · 2026-02-10

Migrating from Slurm to Kubernetes

Score 1

Moving from Slurm to Kubernetes doesn't have to mean losing the workflow you know. Here's how SkyPilot brings Slurm-like simplicity to K8s.

Open

Watchlist Matched: none

SkyPilot · open-source · 2025-09-23

Scaling Vector Search to 1M Documents for $0.85

Score 1

How to build production vector search with RedisVL and SkyPilot: 1M documents indexed for $0.85, sub-100ms queries, no Kubernetes required.

Open

Watchlist Matched: none

SkyPilot · open-source · 2024-07-23

Finetune Llama 3.1 on Your Infra

Score 1

Operational guide to finetune Llama 3.1, with everything packaged in a simple SkyPilot YAML.

Open

Watchlist Matched: none