Research

This library implements techniques from recent research on multi-LLM collaboration.

Key Papers

Mixture-of-Agents (2024)

Wang et al. "Mixture-of-Agents Enhances Large Language Model Capabilities" arXiv:2406.04692

The foundational MoA paper introduced the layered architecture where each layer's agents use outputs from the previous layer as auxiliary information. Key findings:

65.1% on AlpacaEval 2.0 using only open-source models (vs GPT-4o's 57.5%)
LLMs exhibit "collaborativeness"—they generate better responses when given reference outputs
Aggregator quality has 2x more impact than proposer quality (coefficients 0.588 vs 0.281)
Optimal configuration: 3 layers, 6 proposers

Self-MoA (2025)

Li et al. "Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?" arXiv:2502.00674

This paper challenged the "diversity is better" assumption:

Self-MoA outperforms standard MoA by 6.6% on AlpacaEval 2.0
Single top model sampled multiple times beats diverse model mixtures
MoA performance is highly sensitive to proposer quality
Mixing different LLMs often lowers average quality

Key Findings

Layer Depth

Performance improves with depth but with diminishing returns:

Layers	AlpacaEval 2.0	Gain
1	~44%	—
2	~61%	+17%
3	~65%	+4%
4	~66%	+1%

Recommendation: 3 layers is the Pareto-optimal configuration.

Number of Proposers

More proposers help, but gains diminish:

Proposers	AlpacaEval 2.0
1	47.8%
2	58.8%
3	58.0%
6	61.3%

Recommendation: 4-6 proposers for quality-critical applications.

Model Roles

Different models excel at different roles:

Model	As Aggregator	As Proposer
GPT-4o	65.7%	—
Qwen1.5-110B-Chat	61.3%	45.8%
WizardLM-8x22B	52.9%	56.9%
LLaMA-3-70B-Instruct	Good	Good

Key insight: Some models are better proposers than aggregators (WizardLM), while others excel at both (Qwen, LLaMA).

Aggregation vs Selection

MoA performs sophisticated synthesis rather than simple selection:

BLEU score analysis shows positive correlation (0.15-0.30) between aggregator output and best proposer
Aggregators incorporate elements from multiple proposals
This outperforms LLM-ranker baselines by ~20 percentage points

Diversity vs Quality Tradeoff

The Self-MoA paper revealed a nuanced picture:

Cross-model diversity (mixing different LLMs) can hurt if it lowers average quality
In-model diversity (sampling one model with temperature) provides sufficient variation
Use heterogeneous MoA when models have complementary strengths
Use Self-MoA when one model is clearly superior

Optimal Configurations

Objective	Configuration
Maximum quality	3 layers, 6 diverse proposers, best aggregator
Cost-effective	2 layers (MoA-Lite)
Single top model	Self-MoA
Low latency	Single layer with strong aggregator

Limitations

High Time-to-First-Token: Model cannot produce output until all layers complete
Cost: Multiple LLM calls per query
Complexity: More moving parts than single-model inference

Future Directions

Active research areas mentioned in the papers:

Dynamic routing: Query-based routing to cost-performance optimal LLMs
Adaptive depth: Early stopping when consensus is reached
Streaming: Chunk-wise aggregation to reduce TTFT (up to 93% reduction reported)

Citation

If you use this library in research, please cite the original papers:

@article{wang2024mixture,
  title={Mixture-of-Agents Enhances Large Language Model Capabilities},
  author={Wang, Junlin and Wang, Jue and Athiwaratkun, Ben and Zhang, Ce and Zou, James},
  journal={arXiv preprint arXiv:2406.04692},
  year={2024}
}

@article{li2025rethinking,
  title={Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?},
  author={Li, Wenzhe and others},
  journal={arXiv preprint arXiv:2502.00674},
  year={2025}
}