Pipelines

This page documents paper-accurate pipeline configurations that reproduce benchmark results.

Together MoA (65.1% AlpacaEval)

The benchmark-winning configuration from Wang et al. (2024): 3 layers, 6 diverse proposers, Qwen aggregator.

from mixture_llm import Propose, Synthesize, Aggregate

PROPOSERS = [
    "wizardlm-2-8x22b",
    "qwen1.5-110b-chat",
    "qwen1.5-72b-chat",
    "llama-3-70b-instruct",
    "mixtral-8x22b-instruct",
    "dbrx-instruct",
]

together_moa = [
    Propose(PROPOSERS, temp=0.7, max_tokens=512),
    Synthesize(PROPOSERS, temp=0.7, max_tokens=512),
    Synthesize(PROPOSERS, temp=0.7, max_tokens=512),
    Aggregate("qwen1.5-110b-chat"),
]

How it works

Layer 1 (Propose): All 6 models independently answer the query
Layer 2 (Synthesize): Each model sees all 6 responses and synthesizes
Layer 3 (Synthesize): Another round of synthesis
Final (Aggregate): Qwen combines everything into the final answer

MoA-Lite (59.3% AlpacaEval)

Cost-optimized 2-layer variant—still beats GPT-4o while using ~3.5x fewer tokens.

moa_lite = [
    Propose(PROPOSERS, temp=0.7, max_tokens=512),
    Synthesize(PROPOSERS, temp=0.7, max_tokens=512),
    Aggregate("qwen1.5-72b-chat"),
]

MoA with GPT-5 Aggregator (65.7% AlpacaEval)

The highest-scoring configuration uses GPT-5 Nano as the final aggregator:

moa_gpt5 = [
    Propose(PROPOSERS, temp=0.7, max_tokens=512),
    Synthesize(PROPOSERS, temp=0.7, max_tokens=512),
    Synthesize(PROPOSERS, temp=0.7, max_tokens=512),
    Aggregate("gpt-5-nano-2025-08-07"),
]

Aggregator selection

Research shows aggregator quality has 2x more impact on final performance than proposer quality. Invest in your aggregator model.

Self-MoA (+6.6% over standard MoA)

Li et al. (2025) showed that sampling one top model multiple times can outperform diverse model mixtures by 6.6% on AlpacaEval.

# Same model, multiple samples via temperature
self_moa = [
    Propose(["gpt-5-nano-2025-08-07"] * 6, temp=0.7),
    Aggregate("gpt-5-nano-2025-08-07"),
]

When to use Self-MoA

Use Self-MoA when you have access to one very strong model. The key insight: mixing different LLMs often lowers average quality. A single excellent model sampled multiple times can beat a diverse group of weaker models.

Self-MoA Sequential

For context-limited scenarios, Self-MoA-Seq uses a sliding window approach:

# Iterative refinement for long contexts
self_moa_seq = [
    Propose(["gpt-5-nano-2025-08-07"] * 3, temp=0.7),
    Aggregate("gpt-5-nano-2025-08-07"),
    Propose(["gpt-5-nano-2025-08-07"] * 3, temp=0.7),
    Aggregate("gpt-5-nano-2025-08-07"),
]

Robust MoA

Add shuffle and dropout to prevent positional bias and improve diversity:

from mixture_llm import Shuffle, Dropout

robust_moa = [
    Propose(["gpt-5-nano-2025-08-07", "claude-sonnet-4-5", "llama-3.3-70b", "gemini-2.5-flash"]),
    Shuffle(),
    Dropout(0.2),
    Aggregate("gpt-5-nano-2025-08-07"),
]

Rank-then-Aggregate

Use an LLM to select the best responses before aggregating:

from mixture_llm import Rank

rank_aggregate = [
    Propose(["gpt-5-nano-2025-08-07", "claude-sonnet-4-5", "llama-3.3-70b", "gemini-2.5-flash", "gpt-5-nano-2025-08-07"]),
    Rank("gpt-5-nano-2025-08-07", n=3),  # Keep top 3
    Aggregate("gpt-5-nano-2025-08-07"),
]

Vote (Consensus Selection)

For tasks with clear correct answers, use voting to find consensus:

from mixture_llm import Vote

vote_pipeline = [
    Propose(["gpt-5-nano-2025-08-07", "claude-sonnet-4-5", "llama-3.3-70b"], temp=0.3),
    Vote("gpt-5-nano-2025-08-07"),
]

Refine Pipeline

Improve each response individually before aggregating:

from mixture_llm import Refine

refine_pipeline = [
    Propose(["gpt-5-nano-2025-08-07", "claude-haiku-4", "llama-3.1-8b"]),
    Refine(["gpt-5-nano-2025-08-07"]),  # GPT-5 Nano improves each response
    Aggregate("gpt-5-nano-2025-08-07"),
]

Configuration Guidelines

Use case	Recommended pipeline
Maximum quality	Together MoA (3 layers, 6 proposers)
Cost-effective	MoA-Lite (2 layers)
Single top model available	Self-MoA
Factual/objective tasks	Vote
Need diversity	Robust MoA with Shuffle + Dropout
Context-limited	Self-MoA Sequential

Layer Depth Performance

Research shows diminishing returns beyond 3 layers:

Layers	AlpacaEval 2.0
1	~44%
2	~61%
3	~65%
4	~66%

Recommendation: Use 3 layers for quality-critical applications, 2 layers for cost-sensitive ones.