<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Design on hippotion</title><link>https://blog.hippotion.com/tags/design/</link><description>Recent content in Design on hippotion</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Fri, 26 Jun 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://blog.hippotion.com/tags/design/index.xml" rel="self" type="application/rss+xml"/><item><title>I Let Two AIs Argue Over My Homepage — To See If They Actually Think Differently</title><link>https://blog.hippotion.com/posts/two-ais-argue-homepage/</link><pubDate>Fri, 26 Jun 2026 00:00:00 +0000</pubDate><guid>https://blog.hippotion.com/posts/two-ais-argue-homepage/</guid><description>A single model playing scrum master is just one mind doing all the voices. So I gave two models from two different labs the same deliberately vague design brief, let them collide, and watched what each one refused to give up. I wasn&amp;rsquo;t trying to prove the merge was better — I was trying to prove they were different.</description><content:encoded><![CDATA[<p>I saw someone build an AI agent that &ldquo;leads scrum&rdquo; — runs the standup, asks each
person for an update, summarises. My gut said: that&rsquo;s cosmetic. I spent years at Red
Hat doing the real version of this — standups, retros, design reviews, the whole
liturgy. The ceremony was never the point. The point was that the people in the room
genuinely disagreed with each other.</p>
<p>A single model running a standup can&rsquo;t do that. It&rsquo;s one mind doing all the voices. Ask
it to play five roles and it just agrees with itself politely. So I wanted to test the
opposite: if you put two <em>genuinely different</em> minds in the room — two models from two
different labs — do you get real disagreement, or do they also just nod along?</p>
<h2 id="the-first-idea-was-a-dead-end">The first idea was a dead end</h2>
<p>My first attempt was code review. Have Claude and Codex review the same diff
independently, then compare. I seeded a file with a few subtle bugs and ran it.</p>
<p>They agreed on almost everything. Both caught the shell-injection, both caught the
ignored exit code, both caught the off-by-one. Where they &ldquo;differed&rdquo; turned out to be
the same bug labelled two different ways, or just run-to-run noise.</p>
<p>That&rsquo;s not a failure of the models — it&rsquo;s a failure of the test. A clear bug has one
right answer, so any two competent reviewers converge on it. There&rsquo;s no room for taste
when there&rsquo;s a correct answer. If I wanted to see two minds actually differ, I had to
take away the right answer.</p>
<h2 id="so-i-gave-them-a-vague-brief-instead">So I gave them a vague brief instead</h2>
<p>I have a homelab dashboard — my &ldquo;HQ&rdquo; — and I&rsquo;d been meaning to redesign it into a space
where each of my little agents shows something about itself. I wrote that up
deliberately vaguely: <em>here&rsquo;s what exists, now you decide the structure, the metaphor,
the whole feel.</em> No right answer. Pure taste.</p>
<p>Then: each model designs it alone, no peeking. Then they swap plans and revise — but I
told them explicitly <strong>not</strong> to be agreeable, that holding their own taste was fine.
That instruction matters. Two LLMs shown each other&rsquo;s work will merge politely into mush
if you let them, and a polite merge proves nothing — it&rsquo;s the cosmetic scrum all over
again.</p>
<p>I wasn&rsquo;t trying to prove the combined plan was <em>better</em>. I was trying to prove they were
<em>different</em>.</p>
<h2 id="they-were-different--and-they-held-it">They were different — and they held it</h2>
<p>The two plans came back genuinely unalike.</p>
<p><strong>Claude</strong> designed from a thesis: &ldquo;a dashboard&rsquo;s green lights lie about agents whose
whole job is to stay silent.&rdquo; It built everything off one metaphor and gave the agents
little spoken creeds — <em>&ldquo;I draft. I never send.&rdquo;</em></p>
<p><strong>Codex</strong> designed from a contract: a strict per-agent manifest, &ldquo;show me the mechanics,&rdquo;
explicitly anti-cute — it literally warned that the talking-in-character bit &ldquo;risks
becoming precious.&rdquo;</p>
<p>One designs like a product person with a point of view. The other like a staff engineer
building for extensibility. Same brief, no right answer, two clearly different tastes.</p>
<p>Then the interesting part. When they saw each other&rsquo;s work, they didn&rsquo;t dissolve into
agreement — but they didn&rsquo;t dig in blindly either. <strong>Each conceded in the other&rsquo;s area
of strength and held firm in its own.</strong> Codex dropped its dashboard landing and took
Claude&rsquo;s silence-first idea. Claude threw out its hand-wavy &ldquo;just read the files&rdquo; bit and
adopted Codex&rsquo;s manifest. And on the one pure-taste question — voiced creeds vs. dry
operational tone — <em>neither budged</em>. That surviving disagreement is the cleanest proof
they&rsquo;re two different minds and not one model wearing two hats.</p>
<p>That&rsquo;s a real design review. People hold the things that matter to them and concede the
things that don&rsquo;t. The cosmetic version is everyone nodding.</p>
<h2 id="the-honest-part">The honest part</h2>
<p>I made the three calls the two models couldn&rsquo;t make for me — the ones that were pure
taste, not correctness: keep the voices, use small cards, show the plumbing. Then I took
their merged plan and actually built it. My homepage now opens with a &ldquo;Desk&rdquo; of what the
agents found and a &ldquo;Roost&rdquo; where each one introduces itself in its own voice. It works,
and it&rsquo;s better than what I&rsquo;d have specced alone — but the taste calls were mine, and the
machines were the two reviewers, not the decider.</p>
<p>The thing I&rsquo;ll keep from this: &ldquo;get the AIs to collaborate&rdquo; is easy to fake and the fake
looks fine until you check whether anyone actually disagreed. The tell isn&rsquo;t whether they
reached consensus. It&rsquo;s whether either of them held something back.</p>
]]></content:encoded></item></channel></rss>