Research

Your knowledge base is probably too big

10 curated documents outperformed 200 unfocused ones by 2.5x. The retriever can't distinguish relevant from irrelevant when the corpus is noisy, and no retrieval trick fixes that.

Counsel ResearchJanuary 30, 202615 min read

You'd think giving your advisory committee more evidence would help. More documents, more data, more context -- surely that leads to better decisions. We assumed the same thing. Then we ran 100 structured debates with knowledge bases of varying sizes, and the data told us the opposite: a focused 10-document corpus achieved 0.86 retrieval precision, while a 200-document unfocused collection scored 0.35. The bigger library produced worse reasoning.

We started calling this the curation paradox. It violates a reasonable expectation -- that a superset of good evidence should be at least as good as the subset alone. The rest of this post is about why it happens, what our retrieval architecture can and can't do about it, and what we learned about how committee roles actually use evidence when they have it.

Why more documents make things worse

The explanation is architectural, and once you see it, it's hard to unsee.

The retrieval pipeline doesn't hand all 200 documents to the committee. It retrieves the top-k chunks ranked by cosine similarity to the query. When the corpus is focused, most chunks are relevant, and the top-k set is high quality. When the corpus is unfocused, irrelevant chunks compete for those same positions. A chunk about "Q3 marketing budget reallocation" may embed closer to a query about "pricing strategy" than a genuinely relevant chunk about "competitor pricing analysis" -- both share vocabulary about budgets and strategy.

The retriever can't distinguish semantic proximity from analytical relevance.

This isn't a solvable embedding problem. It's a corpus curation problem.

How we try to retrieve the right things

The retrieval pipeline in counsel/corpus/retriever.py implements a CorpusRetriever class with two stages, and it's worth being honest about what each one can and can't do.

Stage 1: Vector similarity search. The query gets embedded via sentence transformer, then searched against the vector store (SQLite with sqlite-vec in development, PostgreSQL with pgvector in production). We over-fetch by default -- top_k * fetch_multiplier results, typically 3x the target count -- because we need candidates for the diversity pass. A min_score threshold (default 0.3) filters out chunks below minimum similarity.

Stage 2: Diversity filtering. This is a greedy MMR-like (Maximal Marginal Relevance) selection that handles one specific noise type well: near-duplicate chunks. Starting with the top result, it iterates through candidates and includes each one only if its maximum cosine similarity to all already-selected results is below diversity_threshold (default 0.85).

# From counsel/corpus/retriever.py
for i in range(1, len(results)):
    if len(selected_indices) >= target_k:
        break
    candidate_emb = normalized[i]
    max_similarity = 0.0
    for sel_emb in selected_embeddings:
        sim = float(np.dot(candidate_emb, sel_emb))
        max_similarity = max(max_similarity, sim)
    if max_similarity < self._diversity_threshold:
        selected_indices.append(i)
        selected_embeddings.append(candidate_emb)

This eliminates passages from the same document section that repeat the same information with minor variations. What it cannot do is the harder thing: filter out irrelevant but unique chunks that score above min_score due to vocabulary overlap. That's why precision is bounded by corpus focus, not by retrieval sophistication.

The retriever also supports multi-corpus search via search_multi_corpus, which searches each corpus independently and then applies diversity filtering globally across merged results. This turns out to matter operationally -- separating financial data from technical specifications at the corpus level eliminates cross-domain noise structurally, rather than hoping the retriever figures it out.

How every citation traces back to a source document

One thing we wanted to get right early was traceability. Every citation in a Counsel debate traces to a specific document chunk through a deterministic SHA-256 hash chain, implemented across counsel/evidence/pack.py and counsel/core/types.py.

At the document level, documents receive SHA-256 content hashes at upload time, stored in the content_hash column of corpus_documents. The Document model enforces uniqueness per corpus via a UniqueConstraint("corpus_id", "content_hash"), which prevents duplicate ingestion.

At the chunk level, each DocumentChunk inherits its parent document's identity and retains position metadata: chunk_index, page_number, and section_title. The chunk's embedding_id references its vector in the store.

At the evidence pack level, the EvidencePack model bundles sources with a composite hash computed by compute_evidence_hash -- sort sources by ID, serialize with sorted keys, compute SHA-256. The resulting hash follows the format sha256:<64 hex chars>. The validate_evidence_pack function verifies integrity: any modification to sources invalidates the hash.

At the citation level, sources within the evidence pack use structured IDs (src_001, src_014) validated by regex pattern ^src_[0-9]{3}$. Each Source includes EvidenceLocator objects with fine-grained excerpt positioning: line_start, line_end, section identifier, excerpt text, and excerpt_hash.

The complete chain: recommendation -> role argument -> EvidenceCitation with source_ids -> Source in the EvidencePack -> DocumentChunk with position metadata -> Document with content_hash. Each link is independently verifiable. The evidence pack is immutable once assembled -- its hash locks the evidence state for the entire debate. This sounds like over-engineering until the first time someone asks "where did that recommendation come from?" and you can answer with byte-level precision.

Experimental setup

We designed a 3-condition experiment using 100 debates distributed across 5 decision domains (product strategy, financial analysis, security architecture, hiring, technical migration):

No knowledge base (n=34) -- the committee deliberates using model knowledge and the question prompt only
Curated knowledge base (n=33) -- 5 to 25 documents selected for direct relevance
Unfocused knowledge base (n=33) -- 50 to 200 documents including relevant material diluted with tangentially related content

Documents were drawn from a pool of 500 business documents (financial reports, technical specifications, market research, internal memos). Curated corpora contained only documents directly referenced in the debate question or its domain. Unfocused corpora included the curated set plus randomly sampled adjacent-domain documents.

We measured evidence citation rate (structured src_XXX references per role response), retrieval precision (fraction of retrieved chunks rated relevant by 2 human evaluators -- which introduces some subjectivity, especially for tangentially related chunks), factual grounding score (1-10 human evaluation), and role-specific utilization patterns.

Evidence Citation Rate With and Without Knowledge Base

Knowledge bases increase citation rates by 3-5x across all roles. Operator shows the highest absolute citation rate (6.2/response) while Edge Case Hunter shows the largest relative gain (5.6x increase).

Retrieval Precision by Corpus Size

Retrieval precision degrades sharply as corpus size grows beyond 25 documents. A curated corpus of 5-10 highly relevant documents (0.86-0.89 precision) dramatically outperforms a large unfocused collection (0.35 at 200 documents).

The numbers, and what surprised us

Knowledge-base-enabled debates produced dramatically more evidence citations, which we expected. What surprised us was how much the unfocused condition underperformed.

Condition	Citations/Response	Factual Grounding	Retrieval Precision
No knowledge base	1.1	6.2 / 10	N/A
Curated KB (5-25 docs)	5.0	8.1 / 10	0.86
Unfocused KB (50-200 docs)	3.8	6.9 / 10	0.35

The curated condition averaged 5.0 citations per response (4.5x the no-KB baseline). The unfocused condition managed only 3.8 -- lower because retrieval surfaced marginally relevant chunks that roles chose not to cite.

But factual grounding tells the sharper story. Curated KBs scored 8.1/10, a 31% improvement over no-KB. Unfocused KBs scored only 6.9/10, an 11% improvement -- a fraction of the curated benefit despite containing a superset of the same documents. The unfocused corpus introduces retrieval noise that dilutes the committee's evidence base: more sources available, less precise reasoning from them.

There's a cliff at 25 documents

We initially assumed precision would decline linearly with corpus size. It doesn't. It exhibits a threshold pattern:

Corpus Size	Mean Retrieval Precision	95% CI
5-10 docs	0.89	[0.86, 0.92]
11-25 docs	0.82	[0.78, 0.86]
26-50 docs	0.62	[0.55, 0.69]
51-100 docs	0.44	[0.37, 0.51]
101-200 docs	0.35	[0.28, 0.42]

Precision is stable from 5 to 25 documents (0.82-0.89), then falls off a cliff. The knee is at approximately 25 documents. Above that, each additional document degrades precision faster than it adds evidence value.

An important caveat: corpus size and corpus focus are confounded in our unfocused condition. A supplementary test (n=8, so treat this directionally) with 50 uniformly relevant documents achieved 0.79 precision -- comparable to a 25-document curated corpus. The degradation is a function of signal-to-noise ratio, not corpus size per se. Large corpora degrade precision because they tend to contain off-topic material in practice, not because bigness is inherently bad.

The diversity filter removes near-duplicate noise but cannot distinguish relevant unique chunks from irrelevant unique chunks. Precision is fundamentally bounded by corpus focus. We also note that we're using a single embedding model (all-MiniLM-L6-v2 in development) -- different models may shift this curve. Query expansion and reranking stages, not yet implemented, could improve precision for unfocused corpora. But we expect the core finding to hold: retrieval sophistication can't fully compensate for poor curation, because the noise lives at the semantic level, not the lexical level.

Different roles use evidence differently

This one was fun to dig into. We disaggregated citation data across the four standard roles defined in counsel/core/types.py (RoleId enum: ADVOCATE, SKEPTIC, OPERATOR, EDGE_CASE_HUNTER), and each role has a surprisingly consistent fingerprint:

Role	Citations/Response	What they cite
Operator	6.2	Implementation details, timelines, resource constraints, technical specs
Skeptic	5.1	Contradictory evidence, declining trends, assumption violations
Advocate	4.8	Supporting metrics, growth data, favorable comparisons
Edge Case Hunter	3.9	Outliers, cross-document connections, non-obvious correlations

Operators cite the most because their analytical function -- evaluating implementation feasibility -- is inherently data-dependent. Claims about timelines, resource requirements, and technical constraints demand grounding. Skeptics rank second because effective challenge requires specific counter-evidence, not abstract objections. The skeptic role guidance across templates consistently instructs finding "contradictory evidence" and "assumption violations," both of which pull toward citation-heavy responses.

The Edge Case Hunter result is the most interesting. They cite the least per response but produce the most analytically novel citations. In 67% of knowledge-base debates, the Edge Case Hunter surfaced at least one document connection that no other role referenced -- typically a cross-domain link between documents that other roles treated independently. The edge_case_hunter role in general.yaml instructs to "think about unusual scenarios and edge cases" and "explore creative alternatives." Those mandates reward lateral connections over exhaustive grounding, and the citation pattern reflects it.

Trust levels weight themselves

Counsel assigns trust levels to evidence sources via the TrustLevel enum in counsel/core/types.py: VERIFIED, HIGH, MEDIUM, LOW, UNVERIFIED. We wanted to understand how these levels affect the committee's weighting in final synthesis, so we measured relative emphasis:

Trust Level	Relative Weight in Synthesis	Examples
VERIFIED	2.3x	SEC filings, audited reports, proprietary data
HIGH	1.7x	Analyst reports, named expert opinions
MEDIUM	1.2x	Industry publications, reputable blogs
LOW	0.7x	Anonymous posts, unattributed claims
UNVERIFIED	1.0x (baseline)	Default for unclassified sources

Here's what's notable: this weighting is emergent, not hard-coded. There is no multiplier table in the synthesis prompt. The synthesis model naturally privileges higher-trust sources when they conflict with lower-trust sources. When the Advocate cites internal revenue data (VERIFIED, reliability_score near 1.0) and the Skeptic cites an industry blog post (MEDIUM), the synthesis gravitates toward the higher-trust source. The trust level metadata in the Source model provides the signal; the model's reasoning produces the weighting.

We should be honest about what this observation is: it's based on synthesis output analysis, not on controlled experiments that manipulate trust levels while holding content constant. The 2.3x figure should be treated as observational, not causal.

Still, the practical implication is clear: corpus composition affects not just what the committee knows but how confidently it reasons. A corpus of VERIFIED internal documents produces recommendations with stronger epistemic grounding than a corpus of scraped web content, even when the factual content is similar.

Source types and evidence policy

Sources are typed via the SourceType enum: COMPETITOR_ANALYSIS, USER_FEEDBACK, MARKET_DATA, TECHNICAL_RESEARCH, INTERNAL_DATA, EXPERT_OPINION, WEB_SEARCH, SOURCE_CODE, ERROR_LOG, GIT_DIFF, DEPENDENCY_INFO, ARCHITECTURE. The evidence quality module in counsel/evidence/quality.py uses keyword-based category inference to classify sources and detect coverage gaps across primary categories (market data, user feedback, competitor analysis, internal data) and supporting categories (technical research, expert opinion).

The EvidencePolicy model in counsel/core/types.py enforces configurable constraints:

freshness_days_max (default 180): Maximum age of evidence in days
min_primary_sources (default 1): Minimum primary sources required
min_coverage_ratio (default 0.8): Required evidence coverage across categories

These policies operate at the system level. The score_evidence_pack function in counsel/evidence/quality.py scores the evidence pack on a 0-100 scale, penalizing for missing primary categories (-12 per gap), missing supporting categories (-6 per gap), and insufficient source diversity.

What actually makes a good corpus

We analyzed the 33 curated-KB debates to identify what separated the top quartile (grounding >= 9.0) from the rest, and three properties kept showing up.

Direct relevance, ruthlessly enforced. All documents were directly relevant to the decision question. No "nice to have" inclusions. The evidence quality scoring penalizes corpora where more than 50% of sources fall into the uncategorized bucket, but we found this too lenient -- even 30% off-topic documents degraded retrieval precision below the curated threshold.

At least one primary source. Corpora with raw data (financial figures, survey results, system metrics) rather than exclusively summaries scored 0.8 points higher on factual grounding. The EvidencePolicy.min_primary_sources default of 1 is the right instinct, but "at least one" undersells the benefit.

Conflicting perspectives. The corpus needs at least two documents with potentially opposing data or conclusions. This gives the Skeptic genuine counter-evidence to work with. A corpus of uniformly supportive documents produces a committee that agrees too quickly -- the Skeptic ends up citing the same data as the Advocate, and the crux phase has nothing substantive to resolve.

The sweet spot is 10-25 highly relevant documents. Below 10, the committee lacks evidence diversity. Above 25, retrieval noise increases faster than evidence value. Our sample of 100 debates limits how much we can say about individual domains (we'd want several hundred to make domain-specific claims with confidence), but this range held across all five domains we tested.

Takeaways

Four operational takeaways, and one open question.

Curate aggressively. 10 well-chosen documents outperform 200 poorly chosen ones by 2.5x on retrieval precision and 17% on factual grounding. The marginal cost of adding an irrelevant document is not zero -- it actively degrades retrieval quality for all queries by competing for top-k positions.

Include primary sources and conflicting data. Raw data produces higher grounding than summaries. Conflicting evidence produces more substantive crux phases. Both properties are enforced at minimum levels by EvidencePolicy but should be actively pursued beyond the minimums.

Separate corpora by decision domain. The search_multi_corpus method searches across multiple corpora with global diversity filtering. Domain-scoped corpora eliminate cross-domain noise at the structural level rather than relying on the retriever to sort it out.

Treat the evidence pack as immutable. The SHA-256 hash chain provides audit-grade traceability, but only if the pack isn't modified between assembly and synthesis. When underlying data changes, create a new evidence pack rather than patching the existing one -- the content hashes will detect stale documents and trigger re-indexing.

The open question: can smarter retrieval -- query expansion, learned rerankers, maybe even retrieval-aware chunking -- push that 25-document cliff further out? We think it can help at the margins. But the noise we're fighting is semantic, not lexical, and no amount of retrieval sophistication fully substitutes for someone deciding which documents actually matter before the debate begins.

Back to all posts