All posts
Research

Peak quality at three overrides

We expected more customization to always improve output. Across 120 debates, quality peaked at exactly 3 configuration overrides, then degraded. The cause: parameter conflicts users can't predict.

Counsel ResearchFebruary 4, 20269 min read

How many knobs should you turn? We assumed the answer was "all of them" -- that giving users fine-grained control over every aspect of a deliberation would yield proportionally better outcomes. Across 120 structured debates, we found the opposite: quality peaks at exactly 3 overrides, then falls off a cliff.

This is a story about a non-monotonic curve we did not expect, the parameter conflicts that cause it, and a Bayesian termination model that emerged from trying to fix one of the symptoms.

The more-is-better intuition is wrong

Counsel exposes five categories of per-debate overrides: committee roles, model bindings, phase guidance, token budgets, and termination conditions. Each is useful on its own. Each makes the system more configurable. The natural assumption is that combining them compounds the benefit.

It does -- up to a point. From a baseline of 6.2/10 with zero overrides, quality climbs steadily to 8.5 at three overrides. Then it reverses: 8.0 at four, 7.6 at five. The drop from three to five overrides erases nearly all the gain from zero to three.

Why? Because the parameters are not independent. They interact, and some combinations create contradictions the user cannot easily foresee.

Configuration in Counsel resolves through three layers of precedence:

  1. Global config (counsel.yaml) -- baseline defaults for all debates
  2. Template defaults -- domain-tuned parameters from one of 18 decision templates
  3. Per-debate overrides -- user-specified, highest precedence

A template like compliance_audit.yaml pairs a 0.85 agreement threshold with 3 attack rounds and 4 crux rounds -- enough runway to actually reach that high bar. The parameters were designed together. Raw overrides from scratch lack that coherence, which is why template + 1-2 overrides (8.7) outperforms 5 raw overrides (7.8) even when the total number of customized parameters is comparable.

The shortcut that actually works

Before diving into individual overrides, it helps to understand a mechanism that sits above them: cost tiers. These are four pre-defined configurations that set roughly 8 parameters in a single declaration via resolve_tier_config in counsel/core/cost_tiers.py.

TierRound Budget (D-A-C-I)Model ClassBudget CapAdaptive Termination
QUICK1-1-1-1Economy (haiku / 4o-mini / flash)$0.50No
STANDARD1-2-3-1Standard (sonnet / 4o / pro)$2.00No
DEEP2-3-4-1Premium (sonnet / 4o / pro)$5.00Yes
EXHAUSTIVE2-4-5-1Premium (sonnet / 4o / pro)$15.00Yes

The design insight here is internal consistency. DEEP's 2-3-4-1 round budget is calibrated for the adaptive termination model it enables. QUICK's 1-1-1-1 budget is calibrated for economy models that produce less nuanced reasoning per round. A CostGuard enforces budget limits at runtime, logging a warning at 80% utilization and terminating the debate if the cap is exceeded.

A user who overrides round counts to 2-4-5-1 but leaves the model tier at economy -- or sets premium models with a $0.50 budget cap -- creates exactly the kind of configuration that tiers were designed to prevent. Selecting DEEP is almost always safer than manually assembling the same parameters.

What moves the needle most

We ran a 6-condition experiment: 120 debates across four decision domains (product strategy, security review, investment analysis, resource allocation), with 20 debates per single-override condition and 60 additional debates testing combined overrides (1 through 5 overrides, n=12 each, selection randomized but balanced across categories). Three blinded evaluators scored each output on actionability, relevance, and analytical depth (1-10 scale, composite = unweighted mean, Krippendorff's alpha = 0.81).

Override Impact on Output Quality (n=120 debates)

Phase Guidance overrides yield the highest quality improvement at +41%, followed by Committee Roles at +34%. Token Budget overrides show the smallest isolated effect.

Override Count vs Diminishing Returns

Quality peaks at 3 overrides (8.5) before declining. Beyond 4 overrides, configuration complexity begins to hurt output quality, suggesting a sweet spot of 2-3 targeted overrides per debate.

The single-override results surprised us in their spread:

Override CategoryMean QualityDelta vs BaselineEffect Size (Cohen's d)
None (baseline)6.2----
Token budgets6.6+0.40.28
Termination conditions7.1+0.90.64
Model bindings7.4+1.20.82
Committee roles8.1+1.91.31
Phase guidance8.7+2.51.74

Phase guidance alone -- a well-written 2-3 sentence block in phases.diverge or phases.attack -- lifted quality by 41%. That is a Cohen's d of 1.74, a massive effect. If you override only one thing, this is the one.

The ordering reveals something we think is genuinely useful: what the committee thinks about (guidance, roles) matters far more than how long or how much it thinks (termination, tokens). A well-aimed committee with default depth beats a poorly-aimed committee with maximum depth. (A caveat: phase guidance has a larger configuration surface area than, say, token budgets, which may partly inflate its measured effect.)

Committee roles came second at +31%. Renaming "Critical Analyst" to "Red Team Lead" with security-focused guidance fundamentally changes what the model prioritizes. Model bindings (+19%) affect depth and creativity per seat through role-model pairings. Termination conditions (+15%) control debate depth. Token budgets (+6%) barely matter -- the defaults handle roughly 90% of cases.

Where it falls apart

When overrides are combined, the curve tells a clean story -- until it doesn't:

Override CountMean Quality95% CI
06.2[5.8, 6.6]
17.0[6.5, 7.5]
27.9[7.4, 8.4]
38.5[8.0, 9.0]
48.0[7.4, 8.6]
57.6[6.9, 8.3]

We expected more customization to always help. It doesn't.

We analyzed the 5-override condition and found that 9 of 12 debates contained at least one parameter conflict -- a pair of overrides whose combined effect was contradictory. Two patterns dominated.

The convergence trap. The most common conflict: an agreement threshold of 0.85 or higher paired with 3 or fewer max total rounds. The threshold is checked in TerminationConfig against the crux stability score each round. When the committee cannot reach 0.85 agreement within 3 rounds, premature termination fires (max_total_rounds exhausted before agreement_threshold is satisfied). Synthesis runs on unresolved cruxes from the CruxRegistry, the DecisionCard hedges, recommendations go generic, and evaluators score low on actionability.

The compression squeeze. Guidance instructs the committee to "provide specific implementation timelines with rollback checkpoints," but TokenBudgets.attack is set to 600 tokens (versus the default 3,500). The model simply cannot comply. Under hard_limit enforcement, the response is cut off mid-analysis. Under truncate_with_warning, it completes but loses the depth the guidance demanded.

Both conflicts share the same root cause: two parameters pulling in opposite directions with no mechanism to detect the contradiction at configuration time. This is a design gap we have not yet closed, though it points toward future work on conflict detection in the config resolver.

Teaching the system when to stop

One of the symptoms we kept seeing in longer debates was wasted rounds -- the committee circling arguments it had already resolved. For DEEP and EXHAUSTIVE tiers, we built an adaptive termination model that learns convergence from the debate itself.

The ConvergenceTracker in counsel/core/convergence.py uses a Beta-Binomial model with Bayesian updating:

Prior: Beta(1, 1) -- uniform, no initial bias toward convergence or divergence.

Each round: The tracker receives three signals: agreement_score (0-1, from crux stability), drift_distance (cosine distance from the drift monitor), and novelty_count (new points this round). A round is classified as convergent if all three conditions hold: agreement > 0.6, drift < 0.1, novelty < 3 new points.

Update: Convergent round increments alpha by 1. Productive round increments beta by 1. The posterior P(converged) = alpha / (alpha + beta).

Termination fires when two conditions are met simultaneously: (1) the posterior exceeds 0.85 (configurable), and (2) the last N rounds (configurable convergence_window, default 2) were all convergent.

The dual criterion matters. The posterior alone is sticky upward -- it could stay above 0.85 after several convergent rounds followed by a genuinely productive one. Requiring the last N rounds to also be convergent ensures termination only fires on sustained convergence, not merely historical likelihood.

In QUICK and STANDARD tiers, this model is disabled (adaptive=False). Those tiers run fixed rounds for predictable cost, sacrificing early-stop capability. We have not yet tested the model under extreme configurations -- very high convergence thresholds with very low round budgets -- and its behavior there remains an open question.

Recommendations

Start from the closest template. Templates contribute +1.4 quality points over raw defaults. Even an imperfect domain match outperforms building from scratch because of internal parameter consistency.

Override 1-3 parameters, no more. Peak quality is at 3. Beyond that, conflict risk rises sharply. If you find yourself reaching for a fourth override, reconsider whether a different template would be a better starting point.

Phase guidance first. At +41%, it is the single highest-leverage configuration change. A 2-3 sentence instruction in the attack or diverge phase outperforms every other individual knob.

Use cost tiers instead of manual round tuning. Selecting DEEP gives you pre-calibrated rounds, model tiers, budget caps, and adaptive termination in one move.

Leave token budgets alone. They produced the smallest quality improvement and are the most likely to conflict with phase guidance. Override them only when you observe truncated outputs.

Open questions

The optimal override count of 3 is an average across four domains. The right number for security reviews may differ from investment analyses -- our 120-debate sample is large enough to see the curve but not large enough to decompose it by domain with confidence. Quality evaluation, even with blinded multi-rater scoring and an alpha of 0.81, retains inherent subjectivity.

The non-monotonic curve also raises a design question we have not answered: should the system warn users when they are approaching or crossing the cliff? Detecting parameter conflicts at configuration time is tractable for the two patterns we identified, but the general case -- arbitrary interactions between five override categories -- is harder.

And then there is the termination model's convergence threshold. We set it at 0.85 across both DEEP and EXHAUSTIVE tiers, but there is no strong reason it should be the same for both. An EXHAUSTIVE debate with a $15 budget cap can afford more rounds of verification. A tier-dependent threshold might be the right next step.

The clearest takeaway is also the simplest: pick a good template, write sharp phase guidance, and resist the urge to configure everything. The system already has well-calibrated defaults. The best results come from steering it, not rebuilding it.