Same model, same blind spots
When one LLM plays both sides of a debate, 74% of the analytical vocabulary is shared between opposing roles. The real diversity premium isn't in quality -- it's in consistency.
Imagine hiring five consultants for a critical strategic decision, then discovering they all graduated from the same program, read the same papers, and trained under the same mentor. They would disagree with each other -- consultants always do -- but their disagreements would trace the same fault lines, use the same frameworks, and share the same blind spots. You would get the appearance of diverse counsel without the substance.
This is what happens when you assign five roles to one LLM.
We built Counsel's multi-model committee system on the hypothesis that genuine analytical diversity requires different reasoning engines, not just different prompts. To test this, we needed to measure something subtle: not whether agents disagree (they always will, when instructed to), but whether their disagreement reflects genuinely different ways of thinking about a problem.
What we found was more interesting than we expected. The diversity premium is real, but it does not show up where most people would look.
What "theatrical disagreement" actually looks like
A Skeptic and an Advocate backed by the same model will oppose each other on cue. The Skeptic says "this approach carries significant risk" while the Advocate says "the opportunity outweighs the risk." So far, so good. But look closer at their reasoning and something uncomfortable emerges: identical argument structures, the same three-point evidence pattern, the same categories of supporting references, and -- critically -- the same blind spots.
We call this theatrical disagreement. The opposition lives in the conclusions, not in the analysis. When both sides of a debate share identical training data, RLHF objectives, and reasoning priors, their blind spots become invisible to the entire deliberation. The debate performs rigor while providing the substance of one model's thinking wearing two different hats.
How we measured this
We repurposed Counsel's existing novelty detector (counsel/novelty/detector.py) as a disagreement quality metric. It implements four layers of checking, originally designed to prevent stale arguments across debate rounds -- but it turns out to work well for measuring whether opposing roles are genuinely thinking differently.
Layer 1: Structural gate. Every round, every role must contribute at least min_new_points (default 2) novel insights. Outputs without new points are rejected outright.
Layer 2: Exact hash dedup. SHA-256 of the full output catches the rare case where an LLM produces literally identical text.
Layer 3: Token overlap. This is the core measurement. The detector extracts specific argument fields per phase -- in the attack phase, that means steelman.strengthened_argument, attack.fatal_flaw, attack.failure_scenario, and crux.statement. It tokenizes these into word sets and computes an overlap ratio: len(current & previous) / len(previous).
Here is the critical design choice: citation fields are deliberately excluded. The detector skips source_ids, supporting_evidence, disconfirming_evidence, and new_evidence via pattern matching. This means two responses that cite different sources but make the same argument are correctly flagged as overlapping. We want novelty in reasoning, not just in bibliography. (This is a word-level heuristic, not a semantic similarity measure -- two responses can use different words for identical reasoning, or the same words for different reasoning. We address this limitation below, but the signal has proven useful despite the noise.)
Layer 4: Stance validation. In the crux phase, a role with an unchanged stance must provide new evidence, new concessions, or declare a final position. No coasting.
The key insight was computing token overlap between the Advocate's and Skeptic's attack-phase outputs, rather than between rounds of the same role. High overlap between opposing roles means they are reaching opposite conclusions via the same analytical path -- the signature of theatrical disagreement.
The binding architecture
Each committee role has a two-tier model binding: a primary and a fallback, specified in counsel.yaml:
advocate:
primary: { provider: anthropic, model: claude-sonnet-4-5-20250929 }
fallback: { provider: openai, model: gpt-4o }
skeptic:
primary: { provider: openai, model: gpt-4o }
fallback: { provider: anthropic, model: claude-sonnet-4-5-20250929 }
edge_case_hunter:
primary: { provider: google, model: gemini-1.5-pro }
fallback: { provider: anthropic, model: claude-3-5-haiku-latest }
One detail that matters more than it might seem: the fallback is always from the same provider as the primary. When Claude falls back, it falls back to Claude Haiku, not to GPT. This preserves the committee's diversity profile during outages. A Claude role silently falling back to GPT would collapse exactly the cross-provider tension the binding was designed to create.
We compute a diversity score as unique_providers / total_roles. A 5-role committee with 3 providers scores 0.60; a single-provider committee scores 0.20. This metric is surfaced in debate metadata and can serve as a quality gate.
The experiment
We constructed 150 decision scenarios across five domains (strategic planning, technical architecture, resource allocation, risk assessment, market entry -- 30 each) and ran each through four binding conditions with identical role definitions, guidance, and DACI protocol parameters:
- All-Claude: 5 roles on Claude Sonnet (diversity: 0.20)
- All-GPT: 5 roles on GPT-4o (diversity: 0.20)
- 2-provider: Advocate + Operator on Claude, Skeptic + Red Team on GPT-4o, Edge Case Hunter on Claude (diversity: 0.40)
- 3-provider: Advocate + Operator on Claude, Skeptic + Red Team on GPT-4o, Edge Case Hunter on Gemini 1.5 Pro (diversity: 0.60)
The Synthesizer used Claude Opus in all conditions to control for synthesis quality. Temperature 0.7, maximum 3 DACI rounds, 0.75 convergence threshold. Three blinded raters scored robustness and actionability (0-10 each; Cohen's kappa = 0.81).
Theatrical vs Substantive Disagreement: Token Overlap After Citation Removal
Same-provider committees show 68-72% token overlap between opposing roles after citation removal - the signature of theatrical disagreement. Cross-provider committees drop to 30-41%, indicating genuinely different reasoning patterns. Measured using Counsel's novelty detector with citation fields excluded.
Diversity Score: Quality Improvement and Variance Reduction
Diversity score = unique providers / total roles. Quality improves with diminishing returns (largest jump at 0.20→0.40). The underappreciated benefit: variance drops 3x, making outcomes more predictable. A 3-provider committee is not just better on average - it's more reliable.
The overlap numbers were stark
| Condition | Advocate-Skeptic Overlap | Advocate-Red Team Overlap | Mean Novel Points |
|---|---|---|---|
| All-Claude | 0.74 | 0.68 | 6.2 / debate |
| All-GPT | 0.71 | 0.66 | 7.1 / debate |
| 2-provider | 0.38 | 0.31 | 16.4 / debate |
| 3-provider | 0.31 | 0.26 | 19.2 / debate |
Single-provider committees showed 0.71-0.74 token overlap between the Advocate and Skeptic -- roles with diametrically opposed mandates. Nearly three-quarters of the analytical vocabulary is shared. They "disagree" using the same words, the same sentence structures, the same categories of evidence. The disagreement is in the polarity of the conclusion, not in the reasoning that produces it.
Cross-provider committees cut this to 0.31-0.38: a 58% reduction. When Claude's Advocate builds a case emphasizing confidence intervals and multi-factor tradeoffs, GPT-4o's Skeptic attacks through systematic logical decomposition. The disagreement is in the analysis, not just the conclusion.
Novel point generation followed the same pattern: 3.1x more genuinely novel analytical contributions in the 3-provider condition (19.2) versus the single-provider average (6.65).
What the recommendations actually looked like
| Condition | Robustness | Actionability | Mean Rounds to Convergence |
|---|---|---|---|
| All-Claude | 5.8 / 10 | 5.6 / 10 | 2.1 |
| All-GPT | 6.1 / 10 | 6.0 / 10 | 2.0 |
| 2-provider | 6.9 / 10 | 7.1 / 10 | 2.6 |
| 3-provider | 7.5 / 10 | 7.2 / 10 | 2.9 |
Heterogeneous committees took 0.8 more rounds to converge (2.9 vs 2.05 average) but produced 23% higher robustness scores. The additional rounds were not wasted -- they were spent resolving genuine disagreements that homogeneous committees never surfaced. In 7.3% of debates (11/150), at least one fallback triggered due to provider rate limits. Mean quality for fallback debates was within 0.3 points of non-fallback debates, suggesting the same-provider fallback design holds up under real conditions.
The surprise was in the variance
This was the result we did not expect. We had been looking at mean quality, and the 2-to-3 provider jump seemed modest. Then we looked at the distribution:
| Condition | Mean Robustness | Std Dev | Min Score | Max Score |
|---|---|---|---|---|
| All-Claude | 5.8 | 2.1 | 2.1 | 9.3 |
| All-GPT | 6.1 | 1.9 | 2.4 | 9.1 |
| 2-provider | 6.9 | 1.4 | 3.8 | 9.4 |
| 3-provider | 7.5 | 1.1 | 5.2 | 9.6 |
Standard deviation dropped from 2.0 (single-provider average) to 1.1 (3-provider) -- a 45% reduction. But the number that jumped out at us was the minimum score: 5.2 for 3-provider, compared to floors of 2.1 and 2.4 for single-provider setups.
Single-provider committees are volatile. Sometimes they produce excellent output when the scenario aligns with the model's strengths. Sometimes they produce terrible output when the scenario hits a systematic blind spot. Multi-provider committees are consistent. The ceiling does not rise dramatically -- but the floor does. Single-provider committees scored below 4.0 in 18% of debates. The 3-provider committee never did, across all 150.
For high-stakes decisions where a bad recommendation is worse than a mediocre one, this matters more than mean quality. The diversity premium is in consistency, not ceiling.
Where you place the providers matters
We had a follow-up question: does it matter which roles get which providers, or is any cross-provider mix equally good? We tested two sub-configurations within the 2-provider condition (n=75 each):
- Adversarial-constructive split: Advocate + Operator on Claude, Skeptic + Red Team on GPT-4o
- Random assignment: Roles randomly assigned to providers (maintaining 2 each)
| Configuration | Mean Robustness | Crux Quality | Overlap (Adv vs. Cons) |
|---|---|---|---|
| Adversarial-constructive split | 7.1 / 10 | 7.4 / 10 | 0.34 |
| Random assignment | 6.6 / 10 | 6.3 / 10 | 0.49 |
The split outperformed random assignment by 0.5 on robustness and 1.1 on crux quality. The intuition is straightforward: provider diversity matters most at the fault lines where disagreement is supposed to happen. The Advocate-Skeptic boundary and the Advocate-Red Team boundary are where genuine analytical tension produces the highest-value cruxes. Placing different providers on each side of those boundaries maximizes the probability that the tension is substantive rather than theatrical.
Diversity compounds through the research layer
There is another dimension to this that goes beyond the debate itself. Counsel's research system routes queries to different providers based on query type, creating provider-specific intelligence before the debate even starts:
| Query Type | Primary Providers | Rationale |
|---|---|---|
| Academic/citation-heavy | Perplexity Sonar, Gemini | Citation depth and verification |
| Code analysis | DeepSeek, Claude | Code comprehension and reasoning |
| Real-time data | Grok (X/Twitter search), ChatGPT (web) | Currency of information |
| Long documents | Kimi (Moonshot, 200K+ context) | Extended context window |
This means the evidence pack entering each debate already contains perspectives from multiple analytical traditions. When the Advocate (Claude) and Skeptic (GPT-4o) then analyze evidence gathered by Perplexity, Gemini, and DeepSeek, the full pipeline touches 5+ distinct reasoning systems. The diversity is not just in the debate -- it is in the evidence the debate consumes.
What remains open
Multi-provider committees come with real operational costs: three API keys, three rate limit budgets, three billing relationships, three points of failure. The fallback system mitigates availability risk but does not eliminate management overhead. Latency variance is 2.1x higher for 3-provider committees, since the slowest provider in each round determines wall-clock time. Cost averaged $0.42 per debate for single-provider and $0.48 for 3-provider -- a modest 14% premium driven by the additional convergence rounds, not pricing differences.
We should also be honest about what our overlap metric can and cannot tell us. It is a word-level heuristic. Two responses can use different words to express identical reasoning (low overlap, same substance) or the same words for genuinely different reasoning (high overlap, different substance). The 15% misclassification rate of the regex-based speech act classifier in our role drift system suggests similar limitations apply here. A semantic similarity measure would be more precise, and that is something we are actively exploring.
The bigger open question is whether the consistency finding generalizes beyond our five test domains. We suspect it does -- the mechanism (different training distributions producing different blind spots that cancel out in committee) seems domain-independent -- but we have not yet tested this at scale across specialized verticals like legal or medical decision-making.
So what
The minimum viable diversity is 2 providers. The jump from 1 to 2 produced the largest marginal gain on every dimension: 2.5x more novel points, +1.0 robustness, 58% lower adversarial-constructive overlap. If you can only add one API key, this is where the value lives.
The third provider buys you something different: not dramatically better outcomes, but dramatically more predictable ones. Whether that trade is worth the operational complexity depends on your stakes.
And if you are going to invest in provider diversity, place it deliberately. The adversarial-constructive split -- different providers on opposite sides of the debate's fault lines -- outperforms random assignment by a meaningful margin. Diversity matters most where disagreement is supposed to happen.
The disagreement is only as real as the diversity that produces it. Measuring the reasoning footprint rather than the source footprint -- overlap after removing citations -- is how we distinguish substance from theatre. We think this framing, and the measurement methodology behind it, has legs beyond our specific system. Anywhere multiple LLM agents are asked to deliberate, the question of whether they are genuinely disagreeing or merely performing disagreement is worth asking.