How we built checkpointing

How we built checkpointing

And why we picked GPT-5 to do the hardest bits

We recently added checkpointing to Conductor: in one click, you can reset your files, git, and chat to a prior state.

Other dev tools like Claude Code and Cursor have their own checkpointing features, but they're leaky.

Suppose you ask Claude Code or Cursor to write you a new feature. It installs a package, generates a database migration, writes some code, and stages its changes. Then you want to reset to before all that happened.

You’ll see something like this:

You may notice:

  • lockfile hasn’t been reset
  • migration hasn’t been reset
  • changes in main.ts have been reverted, but the old changes are still staged

This is because these checkpointing systems work by capturing the contents of a file right before the AI uses its file editing tool. That means it can’t capture changes that the agent makes through other means, like through the linter, package manager, code generator, etc.

Claude Code’s warning to users.

This might seem like a small problem, but we think it's rather big.

It's like leaving a small gap in the fence at Jurassic Park: Conductor's job is to give you a contained environment for AI agents to run wild. A button that resets 90% of the AI's work is very different from a button that resets 100% of the AI's work.

Exploring

One simple thing we could do is create a commit for each checkpoint. But that would mess with your git history. We wanted it to feel like time travel, rewinding your chat history, files and git state as if by magic.

Here are a few other ideas we had that didn't work out:

  1. Commit to a private ref.
    1. Unfortunately, this misses uncommitted and untracked changes. We want to restore your untracked files, staged and unstaged changes, and commit history exactly as they were.
  2. Store checkpoints using git stash
    1. git stash create lets us stash without modifying files on disk, but it doesn’t let us include untracked files in the stash
    2. git stash -u && git stash apply would let us include untracked files in the stash, but it would modify files on the user’s local state temporarily each turn
  3. Save the entire state ourselves in our sqlite db
    1. This is a big undertaking. We’d have to rebuild functionality like git diff on our own.

GPT-5 thinks

At this point, we understood the system requirements well enough to write up an exact spec:

Spec

Requirements:

  • Reversion: Capture an entire state (HEAD + untracked/unstaged/staged changes) and restore it later (basically like git reset --hard, but also restoring untracked/unstaged/staged files)
  • Turn-by-turn diff: Show a diff between two checkpoints (including all files - untracked, unstaged, staged, and committed), or between a checkpoint and the current state
  • No disruption to user: In the process of storing a checkpoint, no files on disk should be modified, nor should there be any obvious public-facing changes in git (e.g., HEAD change)

The API would look like:

  • checkpointId = capture()
  • revert(checkpointId)
  • diff(id1, id2) or diff(id1, 'current')

I threw three things into GPT-5:

  1. The above spec
  2. A description of what we tried that didn't work (as above)
  3. A short instruction: "Assess our options here. Tell me which direction you think we should go. Don't go deep into implementation details yet, but for any ideas you like, for each requirement, tell me in 1-2 sentences how that idea will meet that requirement. be careful not to miss any requirements"

GPT-5 felt it could satisfy all the requirements using hidden git refs. Hooray! But I didn't want to count any chickens until I had something I could test.

GPT-5 builds

For building isolated subsystems, we've found that GPT-5 on its own often does a better job than coding agents like Claude Code or Codex. There's something about the coding agent environment that's distracting to the models when they're trying to build out something new.

(I also get the sense GPT is a bit better at this kind of detail-oriented work than Claude is. For the same task today, I'd probably use GPT-5.1-Codex-Max.)

I made two more requests to GPT-5:

can you sketch, for each function in the API, what the implementation would look like? ideally it'd be so that I can play around with it in my CLI by running the git commands directly, just to verify that it will work as intended
can you build a simple CLI tool called checkpointer with `checkpointer save`, `checkpointer restore`, and `checkpointer diff` options?

And that was it! I tested checkpointer manually, then generated a test suite for it. Everything looked good. I don't think there were any major changes after that initial version, just a few tweaks to the API. GPT-5 virtually one-shotted the implementation.

Here’s the final checkpointer.sh script we're using today:

checkpointer.sh
GitHub Gist: instantly share code, notes, and snippets.

How it works

Our solution

  • Hooks into the agent’s lifecycle to run at the start and end of each turn
  • Captures three pieces of state: the current commit, the index (staged changes), and the worktree (all files including untracked).
  • Converts the index and worktree to tree objects using git write-tree—for the worktree, we write to a temporary index via GIT_INDEX_FILE first.
  • Bundles all three SHA-1s into a commit message and stores the commit as a private ref at .git/refs/conductor-checkpoints/<id>.

One downside of this approach is that if you have two agents running at the same time in the same workspace, you can’t unwind their changes separately; they’ll be all mixed together.

That hasn’t been much of a problem for us, because we built Conductor to make it easy to create isolated workspaces where agents can run independently—and because we tend to use subagents to coordinate multiple agents working on the same task.

Here's the checkpointer in action:

0:00
/0:06

Works like a charm!

🚂
Conductor lets you run a team of coding agents on your Mac. If you enjoy solving problems like this one, reach out—we’re hiring!