<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Debugging on hippotion</title><link>https://blog.hippotion.com/tags/debugging/</link><description>Recent content in Debugging on hippotion</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Fri, 01 Aug 2025 00:00:00 +0000</lastBuildDate><atom:link href="https://blog.hippotion.com/tags/debugging/index.xml" rel="self" type="application/rss+xml"/><item><title>🚨 Don't Restart the Node. Quarantine It First.</title><link>https://blog.hippotion.com/posts/dont-restart-quarantine-first/</link><pubDate>Fri, 01 Aug 2025 00:00:00 +0000</pubDate><guid>https://blog.hippotion.com/posts/dont-restart-quarantine-first/</guid><description>Rebooting a misbehaving node feels productive. It isn&amp;rsquo;t. You&amp;rsquo;re erasing your evidence and skipping the lesson.</description><content:encoded><![CDATA[<h2 id="the-reflex">The reflex</h2>
<p>Something&rsquo;s wrong. A GitLab runner stops picking up jobs. An event processor starts dropping messages. A pod restarts in a loop. The node looks healthy — CPU fine, memory fine — but something is clearly off.</p>
<p>The reflex: restart the node, see if it clears.</p>
<p>Sometimes it does clear, and you move on. But you didn&rsquo;t fix anything. You reset the state and crossed your fingers. If it happens again in two weeks, you&rsquo;ll do the same thing. After enough iterations you have a &ldquo;flaky node&rdquo; that everyone reboots periodically and nobody understands.</p>
<p>There&rsquo;s a better sequence. It takes twenty minutes instead of two, and you come out with either a real fix or actual knowledge of what happened.</p>
<hr>
<h2 id="step-one-quarantine-dont-kill">Step one: quarantine, don&rsquo;t kill</h2>
<p>Before you touch anything, take the node out of rotation without destroying its current state.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">kubectl cordon &lt;node&gt;
</span></span></code></pre></div><p>Cordon marks the node as unschedulable. No new pods land on it. Existing pods keep running. If you need the workloads somewhere else immediately:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">kubectl drain &lt;node&gt; --ignore-daemonsets --delete-emptydir-data
</span></span></code></pre></div><p>Now you&rsquo;ve removed the node from production traffic without rebooting. The node is still alive. Everything that happened on it is still there: logs, open files, kernel ring buffer, running processes, memory state.</p>
<p>This is the difference. A reboot wipes that. A cordon preserves it.</p>
<hr>
<h2 id="step-two-look-at-whats-actually-there">Step two: look at what&rsquo;s actually there</h2>
<p>SSH in. Don&rsquo;t grep for anything specific yet — do a pass for anything unusual.</p>
<p><strong>Kernel messages first.</strong> The kernel will often tell you exactly what went wrong before any application did.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">dmesg -T --level<span class="o">=</span>err,warn <span class="p">|</span> tail -50
</span></span></code></pre></div><p>OOM kills show up here. Disk errors show up here. CPU soft lockups show up here. If you&rsquo;ve got any of those, you have your answer before you&rsquo;ve even looked at application logs.</p>
<p><strong>Check for filesystem problems.</strong></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">df -h          <span class="c1"># is anything full?</span>
</span></span><span class="line"><span class="cl">dmesg <span class="p">|</span> grep -i <span class="s2">&#34;ext4\|xfs\|btrfs\|i/o error\|ata&#34;</span>
</span></span></code></pre></div><p>A filesystem at 100% is silent until it isn&rsquo;t. A flaky drive starts dropping I/O errors into dmesg long before SMART reports anything. Application developers rarely think about this case — their app just starts writing logs that say &ldquo;failed to write&rdquo; without specifying that the disk is full or dying.</p>
<p><strong>System resource pressure.</strong></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">vmstat <span class="m">1</span> <span class="m">5</span>          <span class="c1"># is there swap activity?</span>
</span></span><span class="line"><span class="cl">iostat -x <span class="m">1</span> <span class="m">5</span>       <span class="c1"># is a disk saturated?</span>
</span></span><span class="line"><span class="cl">cat /proc/pressure/io   <span class="c1"># kernel PSI — pressure stall info</span>
</span></span></code></pre></div><p>PSI is underused. It tells you whether processes were actually stalled waiting for I/O, not just whether throughput was high. A disk at 80% utilisation might be fine; a disk with 40% I/O PSI pressure is actively hurting performance.</p>
<p><strong>What were the pods doing right before things went sideways?</strong></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">kubectl describe node &lt;node&gt;    <span class="c1"># events section at the bottom</span>
</span></span><span class="line"><span class="cl">kubectl get events --field-selector involvedObject.kind<span class="o">=</span>Pod -A <span class="p">|</span> sort -k1
</span></span></code></pre></div><p>Look for OOMKilled exits, failed liveness probes, and throttling events. Kubernetes events expire after an hour by default — another reason not to reboot immediately; those events are still there if you look now.</p>
<hr>
<h2 id="a-real-example-the-gitlab-runner">A real example: the GitLab runner</h2>
<p>A GitLab runner pod stops picking up jobs. It looks alive — the process is running, no crashes in the pod logs. Jobs sit in the queue.</p>
<p>Restart reflex: delete the pod, let it reschedule, it picks up jobs again.</p>
<p>But why did it stop?</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">journalctl -u gitlab-runner --since <span class="s2">&#34;1 hour ago&#34;</span>
</span></span><span class="line"><span class="cl"><span class="c1"># or, if it&#39;s a container:</span>
</span></span><span class="line"><span class="cl">kubectl logs &lt;runner-pod&gt; --previous
</span></span></code></pre></div><p>In one instance: the runner&rsquo;s working directory was on a tmpfs that hit its size limit. The runner silently failed to create job workspaces and stopped accepting new jobs. The error was one line in the pod logs: <code>mkdir /builds: no space left on device</code>. The pod was healthy by every other metric.</p>
<p>Fix: bump the tmpfs size limit in the runner config. The restart would have cleared tmpfs temporarily, and the runner would have failed again the next time a large job filled it up.</p>
<p>The debug took five minutes. The permanent fix took two minutes. Without quarantining the node first, the evidence was gone.</p>
<hr>
<h2 id="another-one-the-event-consumer">Another one: the event consumer</h2>
<p>An event processor starts falling behind. Messages queue up. The pod shows no errors. Memory looks fine.</p>
<p>This one was subtler: the processor was connected to a downstream dependency over a persistent TCP connection. The connection had gone into a half-open state — the processor thought it was alive, the remote end had already dropped it. New messages were being sent into a dead socket and silently discarded.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">ss -tnp <span class="p">|</span> grep &lt;pid&gt;    <span class="c1"># look at the socket state</span>
</span></span></code></pre></div><p><code>CLOSE_WAIT</code> on a connection that should be <code>ESTABLISHED</code>. The application wasn&rsquo;t checking whether the connection was actually working before using it, just whether it existed.</p>
<p>Restart would have cleared the socket state, fixed the symptom, and left the bug in the code.</p>
<hr>
<h2 id="what-to-look-for--a-short-checklist">What to look for — a short checklist</h2>
<p>When a node is misbehaving, in order:</p>
<ol>
<li><code>dmesg -T --level=err,warn</code> — kernel errors, OOM kills, disk errors</li>
<li><code>df -h &amp;&amp; df -i</code> — full filesystems (space and inodes separately)</li>
<li><code>kubectl describe node &lt;node&gt;</code> — pressure conditions, recent events</li>
<li><code>kubectl logs &lt;pod&gt; --previous</code> — what the pod logged before it died or got stuck</li>
<li><code>ss -tnp</code> — socket states for network-adjacent issues</li>
<li><code>vmstat 1 5</code> + <code>iostat -x 1 5</code> — resource pressure</li>
<li><code>journalctl -p err -b</code> — system journal errors since last boot</li>
</ol>
<p>Most problems show up in the first three.</p>
<hr>
<h2 id="after-youve-found-something-or-not-found-something">After you&rsquo;ve found something (or not found something)</h2>
<p><strong>If you found the cause:</strong> fix it, test it, uncordon the node.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">kubectl uncordon &lt;node&gt;
</span></span></code></pre></div><p>Document what you found — a comment in the relevant config, a commit message, a note. &ldquo;Fixed runner tmpfs limit&rdquo; in the commit history is more useful than &ldquo;flaky runner, restarted.&rdquo;</p>
<p><strong>If you genuinely found nothing:</strong> that&rsquo;s information too. Cordon, reboot, uncordon, and note that the node rebooted clean with no identified cause. If it happens again, you have a pattern. Check whether anything changed in the workloads around that time. Check whether the reboot timing correlates with anything — cron jobs, backups, maintenance windows.</p>
<p>A reboot you can explain is a fix. A reboot you can&rsquo;t explain is a time bomb.</p>
<hr>
<h2 id="why-this-matters-on-a-single-node-cluster">Why this matters on a single-node cluster</h2>
<p>In a multi-node setup you can afford to be lazier — cordon, drain, reboot, let the scheduler handle it, look at it later. On a single node there&rsquo;s no &ldquo;later.&rdquo; The node coming back is all you&rsquo;ve got.</p>
<p>But the habit is worth building regardless of node count. The engineers who understand their systems are the ones who looked before they rebooted.</p>
<hr>
<h2 id="the-actual-rule">The actual rule</h2>
<p><strong>Quarantine first. Debug second. Restart third (if you still need to).</strong></p>
<p>A restart takes two minutes. The evidence it destroys might take two hours to reconstruct — or might be gone for good. The cordon costs you nothing.</p>
]]></content:encoded></item></channel></rss>