<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Linux on hippotion</title><link>https://blog.hippotion.com/tags/linux/</link><description>Recent content in Linux on hippotion</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Fri, 30 Jan 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://blog.hippotion.com/tags/linux/index.xml" rel="self" type="application/rss+xml"/><item><title>🪟 I Built Yet Another Claude Code Session Switcher</title><link>https://blog.hippotion.com/posts/hand-rolled-claude-session-switcher/</link><pubDate>Fri, 30 Jan 2026 00:00:00 +0000</pubDate><guid>https://blog.hippotion.com/posts/hand-rolled-claude-session-switcher/</guid><description>The web is flooded with Claude Code session managers. I built one more anyway — and the part worth sharing isn&amp;rsquo;t the tool, it&amp;rsquo;s what I had to learn about where Claude actually keeps your sessions.</description><content:encoded><![CDATA[<h2 id="the-confession-first">The confession first</h2>
<p>There are, at last count, a small army of tools that list your Claude Code
sessions and let you jump back into one. tmux wrappers
(<a href="https://github.com/nielsgroen/claude-tmux">claude-tmux</a>,
<a href="https://github.com/0xkaz/claunch">claunch</a>), keyword resumers
(<a href="https://github.com/MaxGhenis/tmux-claude-code">tmux-claude-code</a>), fleet
managers (<a href="https://github.com/raphaelbgr/claude-manager">claude-manager</a>), and a
whole macOS menu-bar genre (<a href="https://github.com/sverrirsig/claude-control">claude-control</a>,
cmux, and friends). They&rsquo;re good. Several are better-engineered than mine.</p>
<p>I built one more anyway.</p>
<p>Not because the others are wrong — because none of them were shaped like <em>my</em>
day, and the cost of hand-rolling a 300-line script turned out to be smaller than
the cost of bending my workflow around someone else&rsquo;s defaults. That&rsquo;s the whole
pitch, and it&rsquo;s a boring one. The interesting part is what I had to understand to
build it, because it corrected a mental model I&rsquo;d had backwards for months.</p>
<h2 id="my-day-concretely">My day, concretely</h2>
<p>I work off a single Linux box over SSH, from a few different machines. A session
might be a homelab change, a side project, a blog post. I drop one mid-thought,
my laptop sleeps, I pick it up that evening from a different terminal. The thing I
kept doing was running <code>claude --resume</code> and squinting at a list of UUIDs trying
to remember which <code>7f3a…</code> was the one about the broken redirect.</p>
<p>I wanted one command — <code>wt</code> — that shows me every session with a human summary
and tells me, truthfully, which ones are still alive. Then lets me pick one.</p>
<p>Simple ask. It sent me reading the on-disk format, and that&rsquo;s where it got
educational.</p>
<h2 id="what-i-had-backwards-tmux-is-not-how-you-keep-a-session">What I had backwards: tmux is not how you keep a session</h2>
<p>Every tmux-first tool sells the same promise: run Claude <em>inside</em> tmux so your
session survives a disconnect. I&rsquo;d internalized that as <strong>&ldquo;tmux is how Claude
sessions persist.&rdquo;</strong></p>
<p>That&rsquo;s wrong, and realizing it deleted half the code I thought I&rsquo;d need.</p>
<p>A Claude Code session is one <code>claude</code> process, keyed by a <code>sessionId</code> UUID. Its
entire transcript — every message, every tool call and result — is appended to a
file:</p>
<pre tabindex="0"><code>~/.claude/projects/&lt;cwd-slug&gt;/&lt;sessionId&gt;.jsonl
</code></pre><p>It&rsquo;s append-only, and it has <strong>no &ldquo;end&rdquo; marker</strong>. When you <code>--resume</code>, Claude
reopens that same file and replays it. One of my session files spans three
calendar days across half a dozen resumes — same file, same UUID, the whole
conversation reconstructed from disk each time.</p>
<p>Which means: <strong>the history is durable independent of any running process.</strong> You
do not need tmux to land exactly where you left off. <code>claude --resume &lt;id&gt;</code> does
that from the transcript alone, on a box with no tmux installed at all.</p>
<p>So what <em>is</em> tmux for, then? Exactly one thing: keeping a process <em>running</em> while
you&rsquo;re disconnected — a long job, an agent grinding away, or re-attaching the
<em>same live process</em> from your phone. That&rsquo;s real, but it&rsquo;s the exception, not the
default. So in my tool, plain resume is the default and tmux is an opt-in flag.
The inversion fell straight out of reading the format honestly.</p>
<h2 id="the-other-thing-the-transcript-doesnt-tell-you-is-it-alive">The other thing the transcript doesn&rsquo;t tell you: is it alive?</h2>
<p>Here&rsquo;s the subtle bit. The transcript tells you the <em>history</em> of a session. It
does <strong>not</strong> tell you whether a <code>claude</code> process is running right now. There&rsquo;s no
&ldquo;closed&rdquo; record — the file for a long-dead session looks identical to one you
left open thirty seconds ago.</p>
<p>Liveness lives somewhere else:</p>
<pre tabindex="0"><code>~/.claude/sessions/&lt;pid&gt;.json   →   { pid, sessionId, cwd, procStart, ... }
</code></pre><p>A session is alive if that pid is actually running. But you can&rsquo;t just trust the
file&rsquo;s existence — it can linger after a crash — and you can&rsquo;t just <code>kill -0</code> the
pid either, because the kernel recycles pids and you might be poking a process
that <em>reused</em> the number. So the honest check is two-factor:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">alive</span><span class="p">(</span><span class="n">pid</span><span class="p">,</span> <span class="n">procstart</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="k">try</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">os</span><span class="o">.</span><span class="n">kill</span><span class="p">(</span><span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>          <span class="c1"># exists and signalable?</span>
</span></span><span class="line"><span class="cl">    <span class="k">except</span> <span class="p">(</span><span class="ne">ProcessLookupError</span><span class="p">,</span> <span class="ne">OSError</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">False</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># ...and is it the SAME process, not a pid-recycle?</span>
</span></span><span class="line"><span class="cl">    <span class="n">stat</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;/proc/</span><span class="si">{</span><span class="n">pid</span><span class="si">}</span><span class="s2">/stat&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">read_text</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">    <span class="n">starttime</span> <span class="o">=</span> <span class="n">stat</span><span class="p">[</span><span class="n">stat</span><span class="o">.</span><span class="n">rindex</span><span class="p">(</span><span class="s2">&#34;)&#34;</span><span class="p">)</span> <span class="o">+</span> <span class="mi">2</span><span class="p">:]</span><span class="o">.</span><span class="n">split</span><span class="p">()[</span><span class="mi">19</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">starttime</span> <span class="o">==</span> <span class="nb">str</span><span class="p">(</span><span class="n">procstart</span><span class="p">)</span>
</span></span></code></pre></div><p>That <code>/proc/&lt;pid&gt;/stat</code> start-time comparison is the difference between &ldquo;I think
it&rsquo;s live&rdquo; and &ldquo;it&rsquo;s live.&rdquo; It&rsquo;s the kind of detail you only get right by caring
about the boring case.</p>
<p>With that, every session resolves to a real state:</p>
<ul>
<li><code>● live</code> — a process is running now</li>
<li><code>⧗ waiting</code> — no process; you left mid-conversation (last line was Claude)</li>
<li><code>· idle</code> — no process; stale</li>
</ul>
<p>And the payoff for getting <em>liveness</em> right: if you try to resume a session that&rsquo;s
still live in another terminal, the tool refuses to double-open it — two processes
appending to one transcript is how you corrupt your own history — and offers a
clean <code>--fork-session</code> instead.</p>
<h2 id="the-summaries-were-free-the-whole-time">The summaries were free the whole time</h2>
<p>The feature I assumed I&rsquo;d have to build — a short, human description of each
session — I didn&rsquo;t build at all. Claude Code already writes one. Buried in the
transcript is a record type:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="cl"><span class="p">{</span><span class="nt">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;ai-title&#34;</span><span class="p">,</span> <span class="nt">&#34;aiTitle&#34;</span><span class="p">:</span> <span class="s2">&#34;Investigate nested o directories&#34;</span><span class="p">,</span> <span class="nt">&#34;sessionId&#34;</span><span class="p">:</span> <span class="s2">&#34;...&#34;</span><span class="p">}</span>
</span></span></code></pre></div><p>Claude titles your sessions for you. The &ldquo;summary&rdquo; column in my tool is just that
field, with a fallback to your last prompt. The best line of code is the one you
delete after noticing the platform already did the work.</p>
<h2 id="so-what-did-i-actually-build">So what did I actually build</h2>
<p>Not much, and that&rsquo;s the point. <code>wt</code> is one Python file, standard library only,
no daemon. It globs the transcripts, reads each one&rsquo;s title and last-activity,
joins that against the pid-verified live registry, sorts live-first, and prints a
numbered list. Pick a number and it <code>exec</code>s into <code>claude --resume</code>. There&rsquo;s a
<code>-t</code> for tmux when I genuinely need it, a <code>d</code> to archive old sessions (a file
move, fully reversible), and a guarded hook that turns it into an SSH login
greeting so the box tells me what&rsquo;s on it the moment I land.</p>
<pre tabindex="0"><code>  watchtower · 5 session(s)
   1) ●  live       16s  homelab    595e931d  Investigate nested o directories
   2) ·  idle     1d07h  notes-app  6565b121  Migrate to server components
  [#]resume  [t#]tmux  [d#]archive  [n]ew  [Enter]shell  [q]uit ▸
</code></pre><p>If you want it, it&rsquo;s <a href="https://github.com/janos-gyorgy/claude-watchtower">on GitHub</a>,
MIT. But honestly, I&rsquo;d rather you take the three things I had to learn than the
tool:</p>
<ol>
<li><strong>Your Claude history lives in a plain append-only JSONL on disk, not in
tmux.</strong> <code>--resume</code> works without any wrapper. Back up <code>~/.claude/projects/</code> and
you&rsquo;ve backed up every conversation you&rsquo;ve had.</li>
<li><strong>Liveness is a separate fact from history</strong>, and checking it honestly means
verifying the pid <em>is the same process</em> — not just that something answers to
the number.</li>
<li><strong>The platform probably already did the boring work</strong> (here: the titles). Read
the format before you write the feature.</li>
</ol>
<p>The flooded-market thing turns out not to matter. A tool that fits your own hands
is worth building even when fifty others exist — <em>especially</em> when it&rsquo;s small
enough that &ldquo;build&rdquo; and &ldquo;understand the system underneath&rdquo; are the same afternoon.</p>
]]></content:encoded></item><item><title>🚨 Don't Restart the Node. Quarantine It First.</title><link>https://blog.hippotion.com/posts/dont-restart-quarantine-first/</link><pubDate>Fri, 01 Aug 2025 00:00:00 +0000</pubDate><guid>https://blog.hippotion.com/posts/dont-restart-quarantine-first/</guid><description>Rebooting a misbehaving node feels productive. It isn&amp;rsquo;t. You&amp;rsquo;re erasing your evidence and skipping the lesson.</description><content:encoded><![CDATA[<h2 id="the-reflex">The reflex</h2>
<p>Something&rsquo;s wrong. A GitLab runner stops picking up jobs. An event processor starts dropping messages. A pod restarts in a loop. The node looks healthy — CPU fine, memory fine — but something is clearly off.</p>
<p>The reflex: restart the node, see if it clears.</p>
<p>Sometimes it does clear, and you move on. But you didn&rsquo;t fix anything. You reset the state and crossed your fingers. If it happens again in two weeks, you&rsquo;ll do the same thing. After enough iterations you have a &ldquo;flaky node&rdquo; that everyone reboots periodically and nobody understands.</p>
<p>There&rsquo;s a better sequence. It takes twenty minutes instead of two, and you come out with either a real fix or actual knowledge of what happened.</p>
<hr>
<h2 id="step-one-quarantine-dont-kill">Step one: quarantine, don&rsquo;t kill</h2>
<p>Before you touch anything, take the node out of rotation without destroying its current state.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">kubectl cordon &lt;node&gt;
</span></span></code></pre></div><p>Cordon marks the node as unschedulable. No new pods land on it. Existing pods keep running. If you need the workloads somewhere else immediately:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">kubectl drain &lt;node&gt; --ignore-daemonsets --delete-emptydir-data
</span></span></code></pre></div><p>Now you&rsquo;ve removed the node from production traffic without rebooting. The node is still alive. Everything that happened on it is still there: logs, open files, kernel ring buffer, running processes, memory state.</p>
<p>This is the difference. A reboot wipes that. A cordon preserves it.</p>
<hr>
<h2 id="step-two-look-at-whats-actually-there">Step two: look at what&rsquo;s actually there</h2>
<p>SSH in. Don&rsquo;t grep for anything specific yet — do a pass for anything unusual.</p>
<p><strong>Kernel messages first.</strong> The kernel will often tell you exactly what went wrong before any application did.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">dmesg -T --level<span class="o">=</span>err,warn <span class="p">|</span> tail -50
</span></span></code></pre></div><p>OOM kills show up here. Disk errors show up here. CPU soft lockups show up here. If you&rsquo;ve got any of those, you have your answer before you&rsquo;ve even looked at application logs.</p>
<p><strong>Check for filesystem problems.</strong></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">df -h          <span class="c1"># is anything full?</span>
</span></span><span class="line"><span class="cl">dmesg <span class="p">|</span> grep -i <span class="s2">&#34;ext4\|xfs\|btrfs\|i/o error\|ata&#34;</span>
</span></span></code></pre></div><p>A filesystem at 100% is silent until it isn&rsquo;t. A flaky drive starts dropping I/O errors into dmesg long before SMART reports anything. Application developers rarely think about this case — their app just starts writing logs that say &ldquo;failed to write&rdquo; without specifying that the disk is full or dying.</p>
<p><strong>System resource pressure.</strong></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">vmstat <span class="m">1</span> <span class="m">5</span>          <span class="c1"># is there swap activity?</span>
</span></span><span class="line"><span class="cl">iostat -x <span class="m">1</span> <span class="m">5</span>       <span class="c1"># is a disk saturated?</span>
</span></span><span class="line"><span class="cl">cat /proc/pressure/io   <span class="c1"># kernel PSI — pressure stall info</span>
</span></span></code></pre></div><p>PSI is underused. It tells you whether processes were actually stalled waiting for I/O, not just whether throughput was high. A disk at 80% utilisation might be fine; a disk with 40% I/O PSI pressure is actively hurting performance.</p>
<p><strong>What were the pods doing right before things went sideways?</strong></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">kubectl describe node &lt;node&gt;    <span class="c1"># events section at the bottom</span>
</span></span><span class="line"><span class="cl">kubectl get events --field-selector involvedObject.kind<span class="o">=</span>Pod -A <span class="p">|</span> sort -k1
</span></span></code></pre></div><p>Look for OOMKilled exits, failed liveness probes, and throttling events. Kubernetes events expire after an hour by default — another reason not to reboot immediately; those events are still there if you look now.</p>
<hr>
<h2 id="a-real-example-the-gitlab-runner">A real example: the GitLab runner</h2>
<p>A GitLab runner pod stops picking up jobs. It looks alive — the process is running, no crashes in the pod logs. Jobs sit in the queue.</p>
<p>Restart reflex: delete the pod, let it reschedule, it picks up jobs again.</p>
<p>But why did it stop?</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">journalctl -u gitlab-runner --since <span class="s2">&#34;1 hour ago&#34;</span>
</span></span><span class="line"><span class="cl"><span class="c1"># or, if it&#39;s a container:</span>
</span></span><span class="line"><span class="cl">kubectl logs &lt;runner-pod&gt; --previous
</span></span></code></pre></div><p>In one instance: the runner&rsquo;s working directory was on a tmpfs that hit its size limit. The runner silently failed to create job workspaces and stopped accepting new jobs. The error was one line in the pod logs: <code>mkdir /builds: no space left on device</code>. The pod was healthy by every other metric.</p>
<p>Fix: bump the tmpfs size limit in the runner config. The restart would have cleared tmpfs temporarily, and the runner would have failed again the next time a large job filled it up.</p>
<p>The debug took five minutes. The permanent fix took two minutes. Without quarantining the node first, the evidence was gone.</p>
<hr>
<h2 id="another-one-the-event-consumer">Another one: the event consumer</h2>
<p>An event processor starts falling behind. Messages queue up. The pod shows no errors. Memory looks fine.</p>
<p>This one was subtler: the processor was connected to a downstream dependency over a persistent TCP connection. The connection had gone into a half-open state — the processor thought it was alive, the remote end had already dropped it. New messages were being sent into a dead socket and silently discarded.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">ss -tnp <span class="p">|</span> grep &lt;pid&gt;    <span class="c1"># look at the socket state</span>
</span></span></code></pre></div><p><code>CLOSE_WAIT</code> on a connection that should be <code>ESTABLISHED</code>. The application wasn&rsquo;t checking whether the connection was actually working before using it, just whether it existed.</p>
<p>Restart would have cleared the socket state, fixed the symptom, and left the bug in the code.</p>
<hr>
<h2 id="what-to-look-for--a-short-checklist">What to look for — a short checklist</h2>
<p>When a node is misbehaving, in order:</p>
<ol>
<li><code>dmesg -T --level=err,warn</code> — kernel errors, OOM kills, disk errors</li>
<li><code>df -h &amp;&amp; df -i</code> — full filesystems (space and inodes separately)</li>
<li><code>kubectl describe node &lt;node&gt;</code> — pressure conditions, recent events</li>
<li><code>kubectl logs &lt;pod&gt; --previous</code> — what the pod logged before it died or got stuck</li>
<li><code>ss -tnp</code> — socket states for network-adjacent issues</li>
<li><code>vmstat 1 5</code> + <code>iostat -x 1 5</code> — resource pressure</li>
<li><code>journalctl -p err -b</code> — system journal errors since last boot</li>
</ol>
<p>Most problems show up in the first three.</p>
<hr>
<h2 id="after-youve-found-something-or-not-found-something">After you&rsquo;ve found something (or not found something)</h2>
<p><strong>If you found the cause:</strong> fix it, test it, uncordon the node.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">kubectl uncordon &lt;node&gt;
</span></span></code></pre></div><p>Document what you found — a comment in the relevant config, a commit message, a note. &ldquo;Fixed runner tmpfs limit&rdquo; in the commit history is more useful than &ldquo;flaky runner, restarted.&rdquo;</p>
<p><strong>If you genuinely found nothing:</strong> that&rsquo;s information too. Cordon, reboot, uncordon, and note that the node rebooted clean with no identified cause. If it happens again, you have a pattern. Check whether anything changed in the workloads around that time. Check whether the reboot timing correlates with anything — cron jobs, backups, maintenance windows.</p>
<p>A reboot you can explain is a fix. A reboot you can&rsquo;t explain is a time bomb.</p>
<hr>
<h2 id="why-this-matters-on-a-single-node-cluster">Why this matters on a single-node cluster</h2>
<p>In a multi-node setup you can afford to be lazier — cordon, drain, reboot, let the scheduler handle it, look at it later. On a single node there&rsquo;s no &ldquo;later.&rdquo; The node coming back is all you&rsquo;ve got.</p>
<p>But the habit is worth building regardless of node count. The engineers who understand their systems are the ones who looked before they rebooted.</p>
<hr>
<h2 id="the-actual-rule">The actual rule</h2>
<p><strong>Quarantine first. Debug second. Restart third (if you still need to).</strong></p>
<p>A restart takes two minutes. The evidence it destroys might take two hours to reconstruct — or might be gone for good. The cordon costs you nothing.</p>
]]></content:encoded></item></channel></rss>