Skip to content

Examples

The examples/ directory contains tested, runnable scripts demonstrating mixture-llm with different LLM providers and configurations.

Quick Reference

Example Provider Key Concepts Best For
openai_basic.py OpenAI Propose, Aggregate Getting started
openai_self_moa.py OpenAI Self-MoA, single model sampling Quality optimization
multi_provider.py OpenAI + Anthropic Provider routing, Shuffle Production setups
openrouter_moa.py OpenRouter 3-layer MoA, Synthesize Paper reproduction
groq_free.py Groq Free tier, Dropout, LiteLLM Experimentation
with_history.py Groq History inspection, Rank Debugging & analysis

Setup

Install Dependencies

pip install mixture-llm[examples]

Or install specific providers:

pip install mixture-llm[openai]    # OpenAI + Anthropic
pip install mixture-llm[litellm]   # Groq via LiteLLM

Set API Keys

export OPENAI_API_KEY=sk-...           # OpenAI examples
export ANTHROPIC_API_KEY=sk-ant-...    # Multi-provider example
export OPENROUTER_API_KEY=sk-or-...    # OpenRouter example
export GROQ_API_KEY=gsk_...            # Groq examples (free)

OpenAI Basic

File: openai_basic.py

The simplest MoA implementation. Start here to understand the core pattern.

from mixture_llm import Propose, Aggregate, run

pipeline = [
    Propose(["gpt-5-nano-2025-08-07"] * 3, temp=0.7),  # 3 proposals via temperature
    Aggregate("gpt-5-nano-2025-08-07"),                 # Combine into final answer
]

result, history = await run(pipeline, query, openai_client)

What it demonstrates:

  • Basic 2-layer MoA (Propose → Aggregate)
  • Using temperature (0.7) to create diverse responses from the same model
  • Client function that returns (text, input_tokens, output_tokens)
  • Token usage and timing analysis

Run:

export OPENAI_API_KEY=sk-...
python examples/openai_basic.py

Self-MoA

File: openai_self_moa.py

Based on Li et al. (2025): sampling one great model multiple times outperforms diverse mediocre models by +6.6% on AlpacaEval 2.0.

pipeline = [
    Propose(["gpt-5-nano-2025-08-07"] * 6, temp=0.7),  # 6 samples, one model
    Aggregate("gpt-5-nano-2025-08-07", max_tokens=1024),
]

What it demonstrates:

  • Self-MoA configuration from the research paper
  • Temperature creates diversity even with identical models
  • Inspecting individual proposals in the history
  • Quality over quantity approach

Run:

export OPENAI_API_KEY=sk-...
python examples/openai_self_moa.py

Multi-Provider

File: multi_provider.py

Mix models from different providers in a single pipeline, combining the strengths of GPT-5 Nano and Claude.

# Route to appropriate provider based on model name
openai_client = AsyncOpenAI()
anthropic_client = AsyncOpenAI(
    base_url="https://api.anthropic.com/v1/",
    api_key=os.environ["ANTHROPIC_API_KEY"],
)

async def multi_provider_client(model, messages, temp, max_tokens):
    client = anthropic_client if model.startswith("claude") else openai_client
    # ... make API call

pipeline = [
    Propose(["gpt-5-nano-2025-08-07", "claude-sonnet-4-5", "gpt-5-nano-2025-08-07"], temp=0.7),
    Shuffle(),  # Prevent position bias
    Aggregate("claude-sonnet-4-5", max_tokens=1024),
]

What it demonstrates:

  • Provider routing based on model name prefix
  • Using Anthropic via OpenAI-compatible API
  • Shuffle() step to randomize response order (prevents position bias)
  • Cross-provider aggregation

Run:

export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
python examples/multi_provider.py

OpenRouter MoA

File: openrouter_moa.py

3-layer MoA approximating the benchmark-winning configuration from Wang et al. (2024) that achieved 65.1% on AlpacaEval 2.0.

PROPOSERS = [
    "qwen/qwen-2.5-72b-instruct",
    "meta-llama/llama-3.3-70b-instruct",
    "mistralai/mixtral-8x22b-instruct",
]

pipeline = [
    Propose(PROPOSERS, temp=0.7, max_tokens=512),      # Layer 1: Independent proposals
    Synthesize(PROPOSERS, temp=0.7, max_tokens=512),   # Layer 2: Each model sees all
    Aggregate("qwen/qwen-2.5-72b-instruct", max_tokens=1024),  # Layer 3: Final synthesis
]

What it demonstrates:

  • Full 3-layer MoA architecture (Propose → Synthesize → Aggregate)
  • Synthesize step where each model sees all previous responses
  • OpenRouter for unified access to multiple providers
  • Model namespacing (provider/model-name format)
  • Timing breakdown by layer

Run:

export OPENROUTER_API_KEY=sk-or-...
python examples/openrouter_moa.py

Groq Free

File: groq_free.py

Zero-cost experimentation using Groq's free tier with fast inference. Demonstrates both standard MoA and Self-MoA.

from litellm import acompletion

async def groq_client(model, messages, temp, max_tokens):
    resp = await acompletion(
        model=f"groq/{model}",  # LiteLLM prefix
        messages=messages,
        temperature=temp,
        max_tokens=max_tokens,
    )
    return (...)

GROQ_FREE = [
    "llama-3.3-70b-versatile",
    "llama-3.1-8b-instant",
    "qwen/qwen3-32b",
    "meta-llama/llama-4-scout-17b-16e-instruct",
]

# Standard MoA with robustness
pipeline = [
    Propose(GROQ_FREE, temp=0.7, max_tokens=512),
    Shuffle(),
    Dropout(0.2),  # 20% random dropout
    Aggregate("llama-3.3-70b-versatile", max_tokens=1024),
]

# Self-MoA variant (also included)
self_moa_pipeline = [
    Propose(["llama-3.3-70b-versatile"] * 4, temp=0.7, max_tokens=512),
    Aggregate("llama-3.3-70b-versatile", max_tokens=1024),
]

What it demonstrates:

  • LiteLLM integration with groq/ prefix
  • Dropout(0.2) step for robustness (randomly drops 20% of responses)
  • Free tier models from Groq
  • Running both standard and Self-MoA in one script
  • Execution details per step

Run:

export GROQ_API_KEY=gsk_...
python examples/groq_free.py

Rate Limits

Groq free tier has rate limits (typically 30 RPM). Check console.groq.com/docs/rate-limits for current limits.


History Inspection

File: with_history.py

Deep inspection of pipeline execution for debugging, analysis, and cost tracking.

pipeline = [
    Propose(["llama-3.3-70b-versatile"] * 4, temp=0.7, max_tokens=256),
    Shuffle(),
    Rank("llama-3.3-70b-versatile", n=2),  # Keep top 2 by quality
    Aggregate("llama-3.3-70b-versatile", max_tokens=512),
]

What it demonstrates:

  • Rank step that uses an LLM to select the top N responses by quality
  • Detailed history inspection with print_history() helper
  • Per-step timing and token counts
  • Output previews at each stage
  • JSON export for external analysis (history.json)
  • Error tracking (✓ for success, ✗ for failures)

History structure:

{
    "step": "Propose",           # Step type name
    "outputs": ["...", "..."],   # Responses after this step
    "llm_calls": [
        {
            "model": "llama-3.3-70b-versatile",
            "time": 2.34,
            "in_tokens": 150,
            "out_tokens": 280,
            "error": None,       # Present only if error occurred
        }
    ],
    "step_time": 2.34            # Total step execution time
}

Run:

export GROQ_API_KEY=gsk_...
python examples/with_history.py

Common Patterns

Client Function Template

All examples follow this signature:

async def client(model, messages, temp, max_tokens) -> tuple[str, int, int]:
    """
    Call your LLM provider.

    Args:
        model: Model identifier string
        messages: List of message dicts with 'role' and 'content'
        temp: Temperature for sampling (0.0-1.0)
        max_tokens: Maximum tokens in response

    Returns:
        tuple of (response_text, input_tokens, output_tokens)
    """
    ...

Token and Time Tracking

result, history = await run(pipeline, query, client)

total_in = sum(c["in_tokens"] for h in history for c in h["llm_calls"])
total_out = sum(c["out_tokens"] for h in history for c in h["llm_calls"])
total_time = sum(h["step_time"] for h in history)

print(f"Tokens: {total_in:,} in, {total_out:,} out")
print(f"Time: {total_time:.2f}s")

Error Handling

for step in history:
    for call in step["llm_calls"]:
        if "error" in call:
            print(f"Error in {call['model']}: {call['error']}")

Next Steps