Wrong templates are worse than no template at all
Mismatched decision templates score 9.4% below unstructured deliberation. The committee reasons rigorously about the wrong things.
We assumed any structure would beat no structure. That seemed safe enough -- give a committee explicit roles, phase guidance, and termination parameters, and it should outperform one that starts from a blank slate. The data told a different story.
When we ran 180 debates across six decision domains, correctly matched templates improved quality by 36% over unstructured deliberation. But deliberately mismatched templates -- a crisis response template applied to a compliance audit, an investment thesis template used for a hiring decision -- scored 9.4% below the unstructured baseline. Not just unhelpful. Actively harmful.
We call this failure mode analytical path divergence: the template's role guidance sends the committee down irrelevant analytical paths that consume the finite round budget without addressing the actual question. The committee reasons rigorously about the wrong things.
What templates actually encode
Each of Counsel's 15 active templates is a YAML file in counsel/counsels/templates/ that configures four parameter groups through the template_to_config_json conversion. These are not generic "be more structured" instructions. They encode genuine domain judgment.
Roles and guidance. Each committee seat (advocate, skeptic, operator, edge_case_hunter) receives a domain-specific name, description, perspective lens, and per-role analytical instructions. The build_vs_buy.yaml template renames the four seats to Build Champion, Buy Advocate, Implementation Lead, and Future State Analyst, each with 40-80 words of role-specific guidance:
roles:
advocate:
name: "Build Champion"
description: "Argues for internal development and control"
perspective: "Ownership-focused, differentiation perspective"
guidance: |
Build the case for internal development.
Identify differentiation and competitive advantage from ownership.
Consider long-term cost curves and customization needs.
Phase instructions. Each DACI phase gets tailored prompts. The investment thesis template instructs diverge to "build competing theses with specific return projections and risk-adjusted scenarios." The crisis response template instructs attack to "focus ONLY on showstopper risks; ignore minor concerns." These are not interchangeable.
Termination conditions. Round counts per phase, total round maximums, and agreement thresholds -- encoding the depth-speed tradeoff specific to each decision type.
Global guidance. Instructions injected into every prompt for every role. The compliance audit template's global guidance: "All claims must be backed by verifiable evidence. Flag any regulatory risks immediately."
Fifteen templates, fifteen decision topologies
Counsel ships with 18 template YAML files; after excluding duplicates and deprecated entries, 15 are actively used: general, financial, engineering_review, product_strategy, market_entry, investment_thesis, hiring_decision, partnership, pricing_strategy, crisis_response, compliance_audit, build_vs_buy, healthcare, founder_exec, and org_change. Additional templates (security_review, technical_migration, resource_allocation) extend the set.
Each encodes a distinct decision topology. The investment thesis template creates competing Bull Case / Bear Case analysts who construct financial return projections. The hiring decision template creates a Talent Champion / Risk & Culture Analyst pairing with a Market & Alternatives Scout who forces consideration of contractors, restructuring, and automation as alternatives to hiring. The crisis response template compresses the entire DACI cycle into 4 max rounds with a 0.60 agreement threshold -- encoding the operational reality that speed dominates thoroughness when the situation is deteriorating.
How we tested this
We designed a 5-condition experiment across 180 debates distributed over 6 decision domains: product strategy, security architecture, investment analysis, hiring, crisis response, and compliance.
The five conditions (n=36 debates per condition, 6 per domain):
- No template -- Custom role names and guidance specified manually
- Wrong template -- Deliberately mismatched (e.g., crisis response template for compliance audits)
- Closest template (unmodified) -- Best-matching template applied without overrides
- Closest template + 1-2 overrides -- Best-matching template with targeted parameter overrides
- Closest template + 3+ overrides -- Best-matching template with 3 or more overrides
Template-domain matching for the "wrong template" condition was randomized with one constraint: the selected template must not share a primary domain with the debate question. Three blinded evaluators scored actionability, relevance, and analytical depth on a 1-10 scale. Composite quality is the unweighted mean. Inter-rater reliability was 0.79 (Krippendorff's alpha).
Template Configuration Profiles
Templates trade thoroughness for speed. Crisis Response uses minimal rounds (1/1) with a low agreement bar (0.60), while Compliance Audit demands deep analysis (3/4 rounds) and near-consensus (0.85).
Decision Quality: Template vs No Template (n=180 debates)
The sweet spot is using the closest matching template with 1-2 targeted overrides (8.7). Using no template or the wrong template consistently underperforms. Excessive overrides (3+) erode the template advantage.
The sweet spot, and the failure cliff
The headline result surprised us less than the shape of the curve. Template + 1-2 overrides was the clear winner. But the gap between "wrong template" and "no template" -- that was the finding we did not expect.
| Condition | Mean Quality | Delta vs No Template | 95% CI |
|---|---|---|---|
| Wrong template | 5.8 | -0.6 | [5.3, 6.3] |
| No template | 6.4 | -- | [5.9, 6.9] |
| Closest template (unmodified) | 7.6 | +1.2 | [7.1, 8.1] |
| Closest + 3 overrides | 8.0 | +1.6 | [7.4, 8.6] |
| Closest + 1-2 overrides | 8.7 | +2.3 | [8.2, 9.2] |
The degradation from 3+ overrides relative to 1-2 overrides is consistent with our per-debate overrides study, where the override cliff appears at 3. Applying 3+ additional overrides on top of a template creates 6-8 total parameter customizations, well past the non-monotonic peak.
What goes wrong: the committee reasons rigorously about the wrong things
We analyzed the 36 wrong-template debates to understand the failure mode. The dominant pattern (28 of 36 debates) was analytical path divergence -- and the examples are striking.
Investment thesis template applied to a hiring decision. The template renames roles to Bull Case Analyst and Bear Case Analyst and instructs diverge to "build competing theses with specific return projections and risk-adjusted scenarios." Applied to a question like "Should we hire a VP of Engineering externally or promote from within?", the committee spent the diverge phase constructing financial return projections for the hire -- calculating hypothetical NPV of an external versus internal candidate. The framing entirely missed the core questions: team fit, role necessity, onboarding risk, and whether the hire should happen at all.
The hiring_decision template, by contrast, includes a Market & Alternatives Scout whose explicit guidance is to "consider alternative approaches: contractors, restructuring, automation" -- forcing the committee to question the premise before evaluating candidates.
This is what made the finding feel structural rather than statistical. Evaluators scored these misdirected debates low on relevance (mean 4.2/10) despite acceptable analytical depth (6.8/10). The analysis was thorough. It was just pointed at the wrong target.
Crisis response template applied to a compliance audit. The crisis template's global guidance reads: "This is a CRISIS decision. Prioritize speed and reversibility. Focus on the immediate 24-72 hour window. Bias toward action." Its skeptic guidance says: "Focus ONLY on catastrophic risks. Do not nitpick." Applied to a compliance audit, this produced recommendations that ignored non-catastrophic regulatory gaps -- precisely the gaps that compliance audits exist to find. The committee rushed to a decision in 2 rounds with 0.60 agreement, when the compliance_audit template would have allocated 3 attack rounds, 4 crux rounds, and required 0.85 agreement before termination.
Termination mismatch: a subtler failure mode
A secondary pattern emerged in 19 of 36 wrong-template debates, often co-occurring with path divergence: termination mismatch. The wrong template's round counts and agreement thresholds were simply miscalibrated for the decision type.
Templates encode domain judgment in their termination block, and these are not arbitrary numbers:
- Crisis response: 0.60 agreement threshold -- perfect consensus matters less than immediate action
- Compliance audit: 0.85 threshold -- regulatory decisions must not proceed with significant unresolved disagreement
- Investment thesis: 0.70 threshold with 4 crux rounds -- financial disagreements benefit from extended debate but rarely reach full consensus
- Security review: 0.80 threshold -- security gaps need thorough resolution but some risk acceptance is realistic
When the crisis template's 0.60 / 4-round limit is applied to a compliance audit, the debate terminates after insufficient resolution. When the compliance template's 0.85 / 14-round limit is applied to a crisis response, the committee deliberates exhaustively while the situation deteriorates. Both are bad outcomes, but they are bad in opposite directions -- which is precisely the point. The termination parameters are not tuning knobs. They are domain assertions.
How the templates differ in practice
Profiling the five most-used templates illustrates how much variation exists:
| Template | Attack Rounds | Crux Rounds | Agreement Threshold | Key Specialization |
|---|---|---|---|---|
| Crisis Response | 1 | 1 | 0.60 | Speed-biased skeptic (catastrophic risks only) |
| Investment Thesis | 3 | 4 | 0.70 | Competing Bull/Bear analysts with return projections |
| Compliance Audit | 3 | 4 | 0.85 | Audit trail enforcement, regulation-by-name citations |
| Security Review | 3 | 3 | 0.80 | Red Team Lead + Threat Modeler adversarial pairing |
| Hiring Decision | 2 | 2 | 0.70 | Forced alternatives analysis (contractors, restructuring) |
The crisis response template occupies the speed extreme: 1 attack round, 1 crux round, global guidance that says "bias toward action." The compliance audit occupies the thoroughness extreme: 3+4 rounds, 0.85 threshold, global guidance that says "all claims must be backed by verifiable evidence." A user who applies the wrong profile inherits termination parameters designed for a different decision topology entirely.
Picking the right template: a three-factor framework
Given that template selection is the highest-leverage configuration decision -- higher-leverage than refining the templates themselves, as our data shows -- we recommend matching on three factors in order.
Decision type first. If the decision is a build-vs-buy evaluation, use build_vs_buy.yaml. Exact type matches produced the highest quality scores (mean 8.1/10 for exact match vs 7.2/10 for adjacent match).
Urgency second. Time-sensitive decisions should prefer templates with compressed termination conditions (crisis_response, resource_allocation). When thoroughness matters more, prefer extended round counts (compliance_audit, investment_thesis).
Domain third. Regulated industries should prefer templates with higher agreement thresholds and evidence enforcement guidance (compliance_audit, healthcare). Technical decisions should prefer templates with implementation-focused roles (engineering_review, technical_migration, security_review).
When no template fits: Start from general.yaml -- which provides balanced roles (Strategic Advocate, Critical Analyst, Implementation Expert, Edge Case Hunter) and neutral phase instructions -- then add 2-3 role overrides to introduce domain-specific perspectives. Our data shows this is strongly preferable to using a mismatched template. A general template with a couple of overrides lands around 7.9, well above a wrong template's 5.8 and comparable to an unmodified correct template's 7.6. When in doubt, General is the safe choice.
Connection to the override study
The interaction between templates and per-debate overrides follows the non-monotonic pattern documented in our override study. Template + 1-2 overrides (8.7) is the global optimum. Template + 3+ overrides (8.0) degrades due to parameter conflicts. No template + 5 raw overrides (7.8) performs worse than template + 1-2 overrides despite more total customization.
The explanation comes down to internal consistency. The compliance_audit.yaml template pairs its 0.85 agreement threshold with 3 attack rounds and 4 crux rounds that give the committee time to reach that high bar. Its global guidance demanding verifiable evidence supports the high threshold by producing evidence-grounded arguments that are easier to reach consensus on. Overriding 1-2 parameters on this base preserves the consistency. Overriding 5 on raw defaults lacks it.
Caveats
A few caveats worth flagging. Template selection in this study was performed by experimenters with domain knowledge. In production, users self-select templates, introducing selection error our design does not capture. The "wrong template" condition was deliberately adversarial -- real-world mismatches are likely less severe (a user probably will not apply crisis_response to a compliance audit), but probably more frequent (users may routinely pick an adjacent-but-not-best template without realizing it).
Template quality is also not uniform. Heavily used templates like crisis_response and investment_thesis have been refined through more iterations than less common ones like org_change and founder_exec. And the 18-template inventory does not cover all decision types. When users face decisions that span multiple domains -- say, a compliance-sensitive technical migration -- the optimal template may not exist at all. How well General performs as a fallback in these hybrid scenarios warrants dedicated study. So does automated template recommendation based on question analysis, which could reduce mismatches before they happen.
Implications
The central finding inverts a common assumption: structure can hurt. A wrong template is not merely unhelpful -- it actively degrades output quality below the unstructured baseline. Three things follow from this.
First, template selection accuracy matters more than template internal quality. Investing in better recommendation (or clearer template descriptions) produces more quality improvement than refining the templates themselves.
Second, the "start from General and override" strategy is underappreciated. It is the safe path when the right template is uncertain -- and uncertainty is the default in production.
Third, analytical path divergence is structural, not statistical. It does not average out over multiple debates or with better models. The template's role guidance deterministically steers the committee's analytical focus. If that focus is wrong, more rounds and better models simply produce more rigorous analysis of the wrong thing. This is perhaps the most important takeaway: the failure mode we identified is not one that scales away. It has to be solved at the selection layer.
We are exploring automated template matching as a next step -- using question analysis to suggest (or warn against) specific templates before a debate begins. Early prototypes are promising, but the work has really just begun.