Linux on hippotion

🪟 I Built Yet Another Claude Code Session Switcher

Fri, 30 Jan 2026 00:00:00 +0000

The confession first

There are, at last count, a small army of tools that list your Claude Code sessions and let you jump back into one. tmux wrappers (claude-tmux, claunch), keyword resumers (tmux-claude-code), fleet managers (claude-manager), and a whole macOS menu-bar genre (claude-control, cmux, and friends). They’re good. Several are better-engineered than mine.

I built one more anyway.

Not because the others are wrong — because none of them were shaped like my day, and the cost of hand-rolling a 300-line script turned out to be smaller than the cost of bending my workflow around someone else’s defaults. That’s the whole pitch, and it’s a boring one. The interesting part is what I had to understand to build it, because it corrected a mental model I’d had backwards for months.

My day, concretely

I work off a single Linux box over SSH, from a few different machines. A session might be a homelab change, a side project, a blog post. I drop one mid-thought, my laptop sleeps, I pick it up that evening from a different terminal. The thing I kept doing was running claude --resume and squinting at a list of UUIDs trying to remember which 7f3a… was the one about the broken redirect.

I wanted one command — wt — that shows me every session with a human summary and tells me, truthfully, which ones are still alive. Then lets me pick one.

Simple ask. It sent me reading the on-disk format, and that’s where it got educational.

What I had backwards: tmux is not how you keep a session

Every tmux-first tool sells the same promise: run Claude inside tmux so your session survives a disconnect. I’d internalized that as “tmux is how Claude sessions persist.”

That’s wrong, and realizing it deleted half the code I thought I’d need.

A Claude Code session is one claude process, keyed by a sessionId UUID. Its entire transcript — every message, every tool call and result — is appended to a file:

~/.claude/projects//.jsonl

It’s append-only, and it has no “end” marker. When you --resume, Claude reopens that same file and replays it. One of my session files spans three calendar days across half a dozen resumes — same file, same UUID, the whole conversation reconstructed from disk each time.

Which means: the history is durable independent of any running process. You do not need tmux to land exactly where you left off. claude --resume does that from the transcript alone, on a box with no tmux installed at all.

So what is tmux for, then? Exactly one thing: keeping a process running while you’re disconnected — a long job, an agent grinding away, or re-attaching the same live process from your phone. That’s real, but it’s the exception, not the default. So in my tool, plain resume is the default and tmux is an opt-in flag. The inversion fell straight out of reading the format honestly.

The other thing the transcript doesn’t tell you: is it alive?

Here’s the subtle bit. The transcript tells you the history of a session. It does not tell you whether a claude process is running right now. There’s no “closed” record — the file for a long-dead session looks identical to one you left open thirty seconds ago.

Liveness lives somewhere else:

~/.claude/sessions/.json   →   { pid, sessionId, cwd, procStart, ... }

A session is alive if that pid is actually running. But you can’t just trust the file’s existence — it can linger after a crash — and you can’t just kill -0 the pid either, because the kernel recycles pids and you might be poking a process that reused the number. So the honest check is two-factor:

def alive(pid, procstart):
    try:
        os.kill(pid, 0)          # exists and signalable?
    except (ProcessLookupError, OSError):
        return False
    # ...and is it the SAME process, not a pid-recycle?
    stat = Path(f"/proc/{pid}/stat").read_text()
    starttime = stat[stat.rindex(")") + 2:].split()[19]
    return starttime == str(procstart)

That /proc//stat start-time comparison is the difference between “I think it’s live” and “it’s live.” It’s the kind of detail you only get right by caring about the boring case.

With that, every session resolves to a real state:

● live — a process is running now
⧗ waiting — no process; you left mid-conversation (last line was Claude)
· idle — no process; stale

And the payoff for getting liveness right: if you try to resume a session that’s still live in another terminal, the tool refuses to double-open it — two processes appending to one transcript is how you corrupt your own history — and offers a clean --fork-session instead.

The summaries were free the whole time

The feature I assumed I’d have to build — a short, human description of each session — I didn’t build at all. Claude Code already writes one. Buried in the transcript is a record type:

{"type": "ai-title", "aiTitle": "Investigate nested o directories", "sessionId": "..."}

Claude titles your sessions for you. The “summary” column in my tool is just that field, with a fallback to your last prompt. The best line of code is the one you delete after noticing the platform already did the work.

So what did I actually build

Not much, and that’s the point. wt is one Python file, standard library only, no daemon. It globs the transcripts, reads each one’s title and last-activity, joins that against the pid-verified live registry, sorts live-first, and prints a numbered list. Pick a number and it execs into claude --resume. There’s a -t for tmux when I genuinely need it, a d to archive old sessions (a file move, fully reversible), and a guarded hook that turns it into an SSH login greeting so the box tells me what’s on it the moment I land.

  watchtower · 5 session(s)
   1) ●  live       16s  homelab    595e931d  Investigate nested o directories
   2) ·  idle     1d07h  notes-app  6565b121  Migrate to server components
  [#]resume  [t#]tmux  [d#]archive  [n]ew  [Enter]shell  [q]uit ▸

If you want it, it’s on GitHub, MIT. But honestly, I’d rather you take the three things I had to learn than the tool:

Your Claude history lives in a plain append-only JSONL on disk, not in tmux. --resume works without any wrapper. Back up ~/.claude/projects/ and you’ve backed up every conversation you’ve had.
Liveness is a separate fact from history, and checking it honestly means verifying the pid is the same process — not just that something answers to the number.
The platform probably already did the boring work (here: the titles). Read the format before you write the feature.

The flooded-market thing turns out not to matter. A tool that fits your own hands is worth building even when fifty others exist — especially when it’s small enough that “build” and “understand the system underneath” are the same afternoon.

🚨 Don't Restart the Node. Quarantine It First.

Fri, 01 Aug 2025 00:00:00 +0000

The reflex

Something’s wrong. A GitLab runner stops picking up jobs. An event processor starts dropping messages. A pod restarts in a loop. The node looks healthy — CPU fine, memory fine — but something is clearly off.

The reflex: restart the node, see if it clears.

Sometimes it does clear, and you move on. But you didn’t fix anything. You reset the state and crossed your fingers. If it happens again in two weeks, you’ll do the same thing. After enough iterations you have a “flaky node” that everyone reboots periodically and nobody understands.

There’s a better sequence. It takes twenty minutes instead of two, and you come out with either a real fix or actual knowledge of what happened.

Step one: quarantine, don’t kill

Before you touch anything, take the node out of rotation without destroying its current state.

kubectl cordon

Cordon marks the node as unschedulable. No new pods land on it. Existing pods keep running. If you need the workloads somewhere else immediately:

kubectl drain  --ignore-daemonsets --delete-emptydir-data

Now you’ve removed the node from production traffic without rebooting. The node is still alive. Everything that happened on it is still there: logs, open files, kernel ring buffer, running processes, memory state.

This is the difference. A reboot wipes that. A cordon preserves it.

Step two: look at what’s actually there

SSH in. Don’t grep for anything specific yet — do a pass for anything unusual.

Kernel messages first. The kernel will often tell you exactly what went wrong before any application did.

dmesg -T --level=err,warn | tail -50

OOM kills show up here. Disk errors show up here. CPU soft lockups show up here. If you’ve got any of those, you have your answer before you’ve even looked at application logs.

Check for filesystem problems.

df -h          # is anything full?
dmesg | grep -i "ext4\|xfs\|btrfs\|i/o error\|ata"

A filesystem at 100% is silent until it isn’t. A flaky drive starts dropping I/O errors into dmesg long before SMART reports anything. Application developers rarely think about this case — their app just starts writing logs that say “failed to write” without specifying that the disk is full or dying.

System resource pressure.

vmstat 1 5          # is there swap activity?
iostat -x 1 5       # is a disk saturated?
cat /proc/pressure/io   # kernel PSI — pressure stall info

PSI is underused. It tells you whether processes were actually stalled waiting for I/O, not just whether throughput was high. A disk at 80% utilisation might be fine; a disk with 40% I/O PSI pressure is actively hurting performance.

What were the pods doing right before things went sideways?

kubectl describe node     # events section at the bottom
kubectl get events --field-selector involvedObject.kind=Pod -A | sort -k1

Look for OOMKilled exits, failed liveness probes, and throttling events. Kubernetes events expire after an hour by default — another reason not to reboot immediately; those events are still there if you look now.

A real example: the GitLab runner

A GitLab runner pod stops picking up jobs. It looks alive — the process is running, no crashes in the pod logs. Jobs sit in the queue.

Restart reflex: delete the pod, let it reschedule, it picks up jobs again.

But why did it stop?

journalctl -u gitlab-runner --since "1 hour ago"
# or, if it's a container:
kubectl logs  --previous

In one instance: the runner’s working directory was on a tmpfs that hit its size limit. The runner silently failed to create job workspaces and stopped accepting new jobs. The error was one line in the pod logs: mkdir /builds: no space left on device. The pod was healthy by every other metric.

Fix: bump the tmpfs size limit in the runner config. The restart would have cleared tmpfs temporarily, and the runner would have failed again the next time a large job filled it up.

The debug took five minutes. The permanent fix took two minutes. Without quarantining the node first, the evidence was gone.

Another one: the event consumer

An event processor starts falling behind. Messages queue up. The pod shows no errors. Memory looks fine.

This one was subtler: the processor was connected to a downstream dependency over a persistent TCP connection. The connection had gone into a half-open state — the processor thought it was alive, the remote end had already dropped it. New messages were being sent into a dead socket and silently discarded.

ss -tnp | grep     # look at the socket state

CLOSE_WAIT on a connection that should be ESTABLISHED. The application wasn’t checking whether the connection was actually working before using it, just whether it existed.

Restart would have cleared the socket state, fixed the symptom, and left the bug in the code.

What to look for — a short checklist

When a node is misbehaving, in order:

dmesg -T --level=err,warn — kernel errors, OOM kills, disk errors
df -h && df -i — full filesystems (space and inodes separately)
kubectl describe node — pressure conditions, recent events
kubectl logs --previous — what the pod logged before it died or got stuck
ss -tnp — socket states for network-adjacent issues
vmstat 1 5 + iostat -x 1 5 — resource pressure
journalctl -p err -b — system journal errors since last boot

Most problems show up in the first three.

After you’ve found something (or not found something)

If you found the cause: fix it, test it, uncordon the node.

kubectl uncordon

Document what you found — a comment in the relevant config, a commit message, a note. “Fixed runner tmpfs limit” in the commit history is more useful than “flaky runner, restarted.”

If you genuinely found nothing: that’s information too. Cordon, reboot, uncordon, and note that the node rebooted clean with no identified cause. If it happens again, you have a pattern. Check whether anything changed in the workloads around that time. Check whether the reboot timing correlates with anything — cron jobs, backups, maintenance windows.

A reboot you can explain is a fix. A reboot you can’t explain is a time bomb.

Why this matters on a single-node cluster

In a multi-node setup you can afford to be lazier — cordon, drain, reboot, let the scheduler handle it, look at it later. On a single node there’s no “later.” The node coming back is all you’ve got.

But the habit is worth building regardless of node count. The engineers who understand their systems are the ones who looked before they rebooted.

The actual rule

Quarantine first. Debug second. Restart third (if you still need to).

A restart takes two minutes. The evidence it destroys might take two hours to reconstruct — or might be gone for good. The cordon costs you nothing.