Self-Consistency and Consensus
Abstract
Generate multiple candidate outputs and aggregate via voting/consensus or reranking to improve factuality and robustness.
Motivation
- Reduce single-sample randomness and hallucinations
- Improve reliability for reasoning and structured tasks
Architectures
- Sample N generations with diverse decoding seeds → vote/majority
- Rerank with secondary model/scorer (retrieval consistency, citations)
- Committee-of-models or Mixture-of-Agents aggregation
Design Choices
- Number of samples vs. latency/cost
- Voting rules (majority, confidence-weighted)
- Reranker selection and features (BM25, cross-encoder)
Pros/Cons
- Pros: Better reliability; simple to implement
- Cons: Higher cost/latency; risk of correlated errors
Evaluation Metrics
- Accuracy vs. baseline; error rate reduction
- Agreement rate among samples; calibration
Vendor/Tooling
- Orchestration frameworks (LangChain, LlamaIndex)
- Retrieval/rerankers from Hugging Face ecosystem
Design Checklist
- Set budget for N and stopping criteria
- Ensure diversity (seeds/temps/prompts)
- Validate with task-specific rubrics
References
- Title: Self-Consistency Improves Chain of Thought Reasoning in LLMs URL: https://arxiv.org/abs/2203.11171 Publisher/Vendor: arXiv Accessed: 2025-08-14 Version_or_release: 2022-03 (preprint)
- Title: Mixture-of-Agents (MoA) style approaches (survey) URL: https://arxiv.org/abs/2402.05120 Publisher/Vendor: arXiv Accessed: 2025-08-14 Version_or_release: 2024-02 (preprint)