Research

What if your devil's advocate is making things worse?

We added a permanent contrarian to our multi-agent debates. Quality went down. The fix: inject adversarial roles dynamically, only when the committee starts agreeing too fast.

Counsel ResearchFebruary 10, 202614 min read

What happens when every member of your advisory committee starts agreeing with each other -- not because the evidence compels agreement, but because they were trained to be agreeable?

We run structured debates between five LLM-backed agents using the DACI protocol (Diverge, Attack, Crux, Integrate). Around round 2 or 3, something predictable happens: the agents start converging. The Skeptic softens its objections. The Edge Case Hunter stops surfacing boundary conditions. The committee produces a recommendation that looks like consensus but is actually surrender -- premature agreement that conceals the same blind spots that existed before the debate started.

This is sycophancy at the system level. Not a single model flattering a user, but multiple agents flattering each other. And across 200 controlled debates, we found it happens in roughly one out of every five runs.

Our first instinct was the obvious one: add a permanent devil's advocate. A sixth role whose job is to always disagree. It turns out this makes things worse. The more interesting story is why, and what actually works instead.

The fishing industry had this problem first

The protocol we built is named after an old practice in the fishing industry: placing catfish in transport tanks alongside cod to keep the cod active during shipping. Without the catfish, the cod go dormant and arrive in poor condition. The adversarial stimulus keeps the system alive.

Counsel's CollapseDetector monitors two signals in real time to decide when the cod need a catfish.

Disagreement trajectory. Each round, the system records an agreement score (0-1) from crux stability analysis. The disagreement score is 1.0 - agreement_score. The detector tracks how this trajectory moves across rounds.

Unjustified position changes. When a role changes its stance between rounds, the system checks whether the stance_delta includes new_evidence citations. A stance change backed by new evidence is a justified change -- the agent learned something. A stance change without new evidence is capitulation.

Collapse is detected when both conditions hold simultaneously:

Disagreement drops by more than collapse_threshold (default: 0.3) between the two most recent rounds
The ratio of unjustified to total position changes exceeds 0.5 (or any unjustified changes exist when require_evidence_for_change is enabled)

The detector requires a minimum of 2 rounds before activating, to avoid flagging normal early-debate convergence as a problem.

When collapse triggers, the system injects a CATFISH role with a specific mandate:

You are the CATFISH - an adversarial challenger of premature consensus.
The debate is experiencing sycophantic collapse: agents are agreeing
without providing new evidence.

Your mandate:
1. DISAGREE with the emerging consensus
2. Provide NOVEL counter-arguments with specific evidence
3. Challenge the strongest point of agreement with a concrete alternative
4. Identify what evidence would be needed to justify the convergence
5. Be constructively adversarial - your goal is better decisions, not obstruction

The critical design choice: the CATFISH is not a permanent committee member. It participates in the attack and crux phases, then deactivates once substantive disagreement resumes. It appears, does its job, and leaves.

Roles don't just collapse -- they drift

Detection alone is not enough if roles can silently drift from their assigned perspectives over the course of a debate. This is a subtler problem than outright collapse. The Skeptic doesn't suddenly become an Advocate; it gradually starts saying "that's a fair point" more often and "however" less often.

Counsel's RoleDriftDetector tracks this by maintaining expected speech act probability distributions for each role across five categories: assert, question, challenge, concede, and support. These profiles encode how each role should behave analytically:

Role	Assert	Question	Challenge	Concede	Support
Advocate	0.40	0.10	0.10	0.10	0.30
Skeptic	0.15	0.25	0.45	0.10	0.05
Operator	0.30	0.25	0.15	0.10	0.20
Edge Case Hunter	0.20	0.35	0.30	0.05	0.10
Catfish	0.25	0.15	0.50	0.05	0.05

The detector classifies each output's dominant speech act using regex pattern matching against keyword indicators -- phrases like "I believe" and "the evidence shows" for assertions, "however" and "fatal flaw" for challenges, "I agree" and "that's a fair point" for concessions. (This is admittedly rough -- manual audit of 500 outputs showed roughly 15% misclassification, and semantic classification would likely do better. But it is fast and good enough to catch the big drifts.)

Drift is computed as L1 distance between expected and observed distributions, normalized to 0-1 by dividing by 2. When drift exceeds threshold (default: 0.4), the system injects a role-specific recenter prompt. For example, the Skeptic gets: "ROLE REMINDER: You are the SKEPTIC. Challenge assumptions, find flaws, and identify risks. Question the basis of claims."

How we tested this

We constructed 200 decision scenarios across four domains: M&A evaluation (n=50), technical migration (n=50), resource allocation (n=50), and crisis response (n=50). Each ran through four conditions:

Control: Standard 5-role committee with no collapse detection
Static adversary: Standard committee plus a permanent 6th contrarian role active in all phases
Catfish (dynamic): Standard committee with CollapseDetector enabled and CATFISH injection on collapse
Catfish + drift: Dynamic injection combined with RoleDriftDetector speech act enforcement

All conditions used the DACI protocol with maximum 3 rounds per phase and 0.75 agreement threshold. Three blinded raters scored outputs on four dimensions (blind spot coverage, recommendation robustness, crux quality, analytical depth, each 0-10). Inter-rater reliability: Cohen's kappa = 0.79.

Sycophantic Collapse Rate and Recovery by Detection Strategy

Dynamic Catfish injection reduces unrecovered collapse from 34% to under 1% of debates. A static permanent adversary recovers only 54% of collapsed debates vs 92% for the dynamic Catfish, which avoids adversarial fatigue by appearing only when needed.

Speech Act Profiles: The Adversarial Gradient

Each role has a target speech act distribution enforced via L1 distance monitoring. The Catfish peaks at 50% challenge acts - higher than even the Skeptic (45%) - but with 0% support, making it the purest adversarial voice. Drift beyond 0.4 L1 distance triggers a recenter prompt.

The permanent devil's advocate trap

Here is what we found:

Condition	Collapse Events	Recovery Rate	Mean Robustness
Control (no detection)	41 / 200 (20.5%)	--	6.2 / 10
Static adversary	12 / 200 (6.0%)	--	6.8 / 10
Catfish (dynamic)	38 / 200 (19.0%)	91% (35/38)	7.9 / 10
Catfish + drift	34 / 200 (17.0%)	94% (32/34)	8.4 / 10

The static adversary did exactly what you would expect: it reduced collapse from 20.5% to 6%. Fewer collapses, problem solved, right? But look at the robustness scores. The static adversary scored 6.8 -- barely above the control's 6.2, and well below the dynamic Catfish at 7.9. The permanent contrarian prevented collapse by creating constant friction, but the committee learned to treat that friction as noise.

The dynamic Catfish condition is the surprising one. It allowed collapse to happen at nearly the same rate as the control (19% vs 20.5%), but recovered 91% of those collapses. The debates that recovered scored 8.3 on robustness. The 3 unrecovered cases involved scenarios where the decision was genuinely clear-cut -- collapse was arguably the right outcome.

We weren't expecting the static adversary to underperform this badly. To understand why, we tracked engagement patterns.

Adversarial fatigue is real and measurable

We measured the word count of committee responses directed at the static contrarian versus responses to other roles:

Round	Mean Words Responding to Contrarian	Mean Words Responding to Other Roles
1	142	156
2	89	148
3	52	141

By round 3, committee members spent 63% fewer words engaging with the permanent contrarian compared to round 1, while engagement with other roles stayed flat. The committee learned that the contrarian always dissents, so each dissent carried less information. Responses became shorter, more dismissive, more perfunctory.

The CATFISH avoids this because it appears unexpectedly. The committee cannot develop antibodies against an intervention that arrives without warning, challenges the specific consensus that formed, and then disappears. You can ignore a voice that always says no. You cannot ignore a new voice that shows up precisely when you stopped arguing.

The adversarial gradient

This finding shaped how we think about the full committee structure. Counsel's six roles form a deliberate adversarial gradient, from constructive to confrontational:

Advocate (constructive) -- argues for the proposed strategy, finds supporting evidence. Optimistic and opportunity-focused.
Operator (pragmatic) -- focuses on execution feasibility and resource requirements. Starts balanced but becomes more adversarial as execution obstacles surface.
Edge Case Hunter (lateral) -- identifies boundary conditions and minority scenarios. Phase participation: attack + crux only. Never sees initial divergent thinking; enters the debate cold.
Skeptic (risk-focused) -- finds fatal flaws, failure modes, and hidden risks. Phase participation: diverge + attack + crux.
Red Team (falsification) -- dedicated adversarial role with a unique success criterion: "your critique must be significant enough to change the final recommendation. Generic risks don't count. You fail if the synthesis proceeds unchanged despite your objections." Phase participation: attack + crux only.
Catfish (anti-consensus) -- must disagree with the emerging consensus. Activated dynamically by CollapseDetector. Phase participation: attack + crux when activated.

The phase participation asymmetry is deliberate and, we think, important. Edge Case Hunter and Red Team skip the diverge phase entirely. They enter cold during the attack phase, so their critiques are not anchored on the Advocate's initial framing. The Synthesizer participates only in the integrate phase, producing the final recommendation without having been part of the debate's social dynamics. Whether these structural choices actually improve output quality is something we haven't yet isolated experimentally -- for now, the intuition is that cognitive independence between roles matters, and phase gating is one way to enforce it.

How roles drift under pressure

Across 200 debates with drift detection enabled, we found systematic patterns in which roles drift and where they drift toward:

Role	Drift Events	Direction	Recovery Rate
Skeptic	24 (12%)	Toward Advocate (increased support)	92%
Advocate	6 (3%)	Toward neutral (decreased assertion)	100%
Operator	18 (9%)	Toward Skeptic (increased challenge)	83%
Edge Case Hunter	8 (4%)	Toward Advocate (decreased questioning)	88%

The Skeptic drifting toward the Advocate is the most common pattern, occurring in 12% of debates. This is consistent with the core sycophancy hypothesis: the model's training rewards agreement, and over multiple rounds the pressure accumulates. The Skeptic starts with a strong challenge profile but gradually shifts toward supportive speech acts.

The Operator drift was the result we didn't expect. The Operator starts balanced (0.30 assert, 0.25 question, 0.15 challenge) but over time increases its challenge rate. When we looked more closely, this made sense: the Operator's mandate is implementation feasibility, and the deeper it examines execution details, the more obstacles it discovers. Implementation-focused roles become more adversarial as they do their job well. Their drift is signal, not noise.

This creates a real problem for the drift detector. Its 83% recovery rate for Operator drift is lower than other roles because recenter prompts that push the Operator back toward less challenging behavior are fighting against legitimate analytical findings. We are exploring phase-dependent drift thresholds that would allow higher challenge rates in later rounds, though this adds complexity we haven't fully worked through yet. Crisis response scenarios also showed higher natural convergence rates, producing 6 false positive collapse detections -- another reminder that fixed thresholds (0.3 disagreement drop, 0.5 unjustified ratio) may need domain-specific tuning.

One more caveat worth noting: the CATFISH role uses the same model as the collapsed role it replaces. It inherits that model's biases. Cross-provider catfish injection -- using a different LLM family for the adversarial voice -- could yield better results, but we haven't tested it. We also haven't tested interaction effects between collapse detection and adaptive termination, which could produce conflicting signals about when a debate should end.

What we take away from this

The central finding is counterintuitive enough to be worth stating plainly: a permanent devil's advocate makes multi-agent deliberation worse, not better. The committee develops adversarial fatigue -- it learns to dismiss the always-contrarian voice rather than engage with it. Dynamic injection, where the adversarial role appears only when collapse is detected, outperformed the static approach by 23% on recommendation robustness.

Three practical takeaways:

First, collapse detection should probably be on by default. A 20% base collapse rate is high enough that undetected sycophantic collapse is the norm, not the exception, in multi-agent debate systems.

Second, pairing collapse detection with drift enforcement works better than either alone. The combined condition (Catfish + drift) produced the highest robustness score (8.4) and highest recovery rate (94%). Drift detection catches the gradual erosion of adversarial perspectives that precedes collapse, providing early warning before the acute event.

Third, and most broadly: sycophantic collapse is not a bug in individual models. It is an emergent property of multi-agent systems built from models that were trained to agree. The right place to address it is the orchestration layer -- detecting it as a system-level phenomenon and correcting it with system-level interventions. Whether the Catfish Protocol is the best such intervention is an open question. That some such intervention is necessary seems, based on these results, fairly clear.

Back to all posts