I saw someone build an AI agent that “leads scrum” — runs the standup, asks each person for an update, summarises. My gut said: that’s cosmetic. I spent years at Red Hat doing the real version of this — standups, retros, design reviews, the whole liturgy. The ceremony was never the point. The point was that the people in the room genuinely disagreed with each other.
A single model running a standup can’t do that. It’s one mind doing all the voices. Ask it to play five roles and it just agrees with itself politely. So I wanted to test the opposite: if you put two genuinely different minds in the room — two models from two different labs — do you get real disagreement, or do they also just nod along?
The first idea was a dead end
My first attempt was code review. Have Claude and Codex review the same diff independently, then compare. I seeded a file with a few subtle bugs and ran it.
They agreed on almost everything. Both caught the shell-injection, both caught the ignored exit code, both caught the off-by-one. Where they “differed” turned out to be the same bug labelled two different ways, or just run-to-run noise.
That’s not a failure of the models — it’s a failure of the test. A clear bug has one right answer, so any two competent reviewers converge on it. There’s no room for taste when there’s a correct answer. If I wanted to see two minds actually differ, I had to take away the right answer.
So I gave them a vague brief instead
I have a homelab dashboard — my “HQ” — and I’d been meaning to redesign it into a space where each of my little agents shows something about itself. I wrote that up deliberately vaguely: here’s what exists, now you decide the structure, the metaphor, the whole feel. No right answer. Pure taste.
Then: each model designs it alone, no peeking. Then they swap plans and revise — but I told them explicitly not to be agreeable, that holding their own taste was fine. That instruction matters. Two LLMs shown each other’s work will merge politely into mush if you let them, and a polite merge proves nothing — it’s the cosmetic scrum all over again.
I wasn’t trying to prove the combined plan was better. I was trying to prove they were different.
They were different — and they held it
The two plans came back genuinely unalike.
Claude designed from a thesis: “a dashboard’s green lights lie about agents whose whole job is to stay silent.” It built everything off one metaphor and gave the agents little spoken creeds — “I draft. I never send.”
Codex designed from a contract: a strict per-agent manifest, “show me the mechanics,” explicitly anti-cute — it literally warned that the talking-in-character bit “risks becoming precious.”
One designs like a product person with a point of view. The other like a staff engineer building for extensibility. Same brief, no right answer, two clearly different tastes.
Then the interesting part. When they saw each other’s work, they didn’t dissolve into agreement — but they didn’t dig in blindly either. Each conceded in the other’s area of strength and held firm in its own. Codex dropped its dashboard landing and took Claude’s silence-first idea. Claude threw out its hand-wavy “just read the files” bit and adopted Codex’s manifest. And on the one pure-taste question — voiced creeds vs. dry operational tone — neither budged. That surviving disagreement is the cleanest proof they’re two different minds and not one model wearing two hats.
That’s a real design review. People hold the things that matter to them and concede the things that don’t. The cosmetic version is everyone nodding.
The honest part
I made the three calls the two models couldn’t make for me — the ones that were pure taste, not correctness: keep the voices, use small cards, show the plumbing. Then I took their merged plan and actually built it. My homepage now opens with a “Desk” of what the agents found and a “Roost” where each one introduces itself in its own voice. It works, and it’s better than what I’d have specced alone — but the taste calls were mine, and the machines were the two reviewers, not the decider.
The thing I’ll keep from this: “get the AIs to collaborate” is easy to fake and the fake looks fine until you check whether anyone actually disagreed. The tell isn’t whether they reached consensus. It’s whether either of them held something back.