Skip to content

Research

This library implements techniques from recent research on multi-LLM collaboration.

Key Papers

Mixture-of-Agents (2024)

Wang et al. "Mixture-of-Agents Enhances Large Language Model Capabilities" arXiv:2406.04692

The foundational MoA paper introduced the layered architecture where each layer's agents use outputs from the previous layer as auxiliary information. Key findings:

  • 65.1% on AlpacaEval 2.0 using only open-source models (vs GPT-4o's 57.5%)
  • LLMs exhibit "collaborativeness"—they generate better responses when given reference outputs
  • Aggregator quality has 2x more impact than proposer quality (coefficients 0.588 vs 0.281)
  • Optimal configuration: 3 layers, 6 proposers

Self-MoA (2025)

Li et al. "Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?" arXiv:2502.00674

This paper challenged the "diversity is better" assumption:

  • Self-MoA outperforms standard MoA by 6.6% on AlpacaEval 2.0
  • Single top model sampled multiple times beats diverse model mixtures
  • MoA performance is highly sensitive to proposer quality
  • Mixing different LLMs often lowers average quality

Key Findings

Layer Depth

Performance improves with depth but with diminishing returns:

Layers AlpacaEval 2.0 Gain
1 ~44%
2 ~61% +17%
3 ~65% +4%
4 ~66% +1%

Recommendation: 3 layers is the Pareto-optimal configuration.

Number of Proposers

More proposers help, but gains diminish:

Proposers AlpacaEval 2.0
1 47.8%
2 58.8%
3 58.0%
6 61.3%

Recommendation: 4-6 proposers for quality-critical applications.

Model Roles

Different models excel at different roles:

Model As Aggregator As Proposer
GPT-4o 65.7%
Qwen1.5-110B-Chat 61.3% 45.8%
WizardLM-8x22B 52.9% 56.9%
LLaMA-3-70B-Instruct Good Good

Key insight: Some models are better proposers than aggregators (WizardLM), while others excel at both (Qwen, LLaMA).

Aggregation vs Selection

MoA performs sophisticated synthesis rather than simple selection:

  • BLEU score analysis shows positive correlation (0.15-0.30) between aggregator output and best proposer
  • Aggregators incorporate elements from multiple proposals
  • This outperforms LLM-ranker baselines by ~20 percentage points

Diversity vs Quality Tradeoff

The Self-MoA paper revealed a nuanced picture:

  • Cross-model diversity (mixing different LLMs) can hurt if it lowers average quality
  • In-model diversity (sampling one model with temperature) provides sufficient variation
  • Use heterogeneous MoA when models have complementary strengths
  • Use Self-MoA when one model is clearly superior

Optimal Configurations

Objective Configuration
Maximum quality 3 layers, 6 diverse proposers, best aggregator
Cost-effective 2 layers (MoA-Lite)
Single top model Self-MoA
Low latency Single layer with strong aggregator

Limitations

  • High Time-to-First-Token: Model cannot produce output until all layers complete
  • Cost: Multiple LLM calls per query
  • Complexity: More moving parts than single-model inference

Future Directions

Active research areas mentioned in the papers:

  • Dynamic routing: Query-based routing to cost-performance optimal LLMs
  • Adaptive depth: Early stopping when consensus is reached
  • Streaming: Chunk-wise aggregation to reduce TTFT (up to 93% reduction reported)

Citation

If you use this library in research, please cite the original papers:

@article{wang2024mixture,
  title={Mixture-of-Agents Enhances Large Language Model Capabilities},
  author={Wang, Junlin and Wang, Jue and Athiwaratkun, Ben and Zhang, Ce and Zou, James},
  journal={arXiv preprint arXiv:2406.04692},
  year={2024}
}

@article{li2025rethinking,
  title={Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?},
  author={Li, Wenzhe and others},
  journal={arXiv preprint arXiv:2502.00674},
  year={2025}
}