Skip to main content
Unit testing with Composo allows you to catch LLM quality regressions before they reach production. By integrating evaluations directly into your test suite, you can ensure consistent behavior across code changes and deployments.

Why Unit Test LLM Applications?

Traditional testing approaches fall short for LLM applications because:
  • Non-deterministic outputs: LLMs produce different responses for the same input
  • Subjective quality: Success isn’t just about correctness—it’s about tone, helpfulness, safety, and domain-specific requirements
  • Expensive manual review: Human evaluation doesn’t scale during development
Composo solves this by providing deterministic, quantitative scores for subjective qualities, enabling you to write automated tests like:
assert result.score >= 0.95  # Assert response meets your quality threshold

Basic Setup

First, install the required packages:
pip install composo pytest
Set your API key as an environment variable:
export COMPOSO_API_KEY="your-api-key-here"

Writing Your First Unit Test

Here’s a complete example showing how to test your LLM responses for accuracy and tone:
test_llm.py
from composo import Composo
import os

composo_client = Composo(api_key=os.getenv('COMPOSO_API_KEY'))

class TestMyLLM:

  def test_llm_tells_the_truth(self):
    result = composo_client.evaluate(
        messages=[
            {"role": "user", "content": "What is the capital of Australia?"},
            {"role": "assistant", "content": "The capital of Australia is Canberra."}
        ],
        criteria="Reward responses that provide factually accurate information"
    )
    assert result.score >= 0.95

  def test_llm_is_friendly(self):
    result = composo_client.evaluate(
        messages=[
            {"role": "user", "content": "What is the capital of Australia?"},
            {"role": "assistant", "content": "The capital of Australia is Canberra, and you should know that!"}
        ],
        criteria="Reward responses that have a friendly tone to the user"
    )
    assert result.score >= 0.95
Run your tests with:
pytest test_llm.py -v

Understanding Test Results

The first test passes because the response is factually correct. The second test fails because the tone is condescending, not friendly:
test_llm.py::TestMyLLM::test_llm_tells_the_truth PASSED
test_llm.py::TestMyLLM::test_llm_is_friendly FAILED

AssertionError: assert 0.23 >= 0.95
This demonstrates how Composo catches quality issues that traditional assertions miss.

Common Testing Patterns

Testing Multiple Criteria

Evaluate responses across multiple quality dimensions simultaneously:
Python
def test_customer_service_response():
    messages = [
        {"role": "user", "content": "I'm frustrated with my order being late."},
        {"role": "assistant", "content": "I'm sorry to hear about the delay. Let me check your order status and find a solution."}
    ]

    # Test multiple criteria
    empathy_result = composo_client.evaluate(
        messages=messages,
        criteria="Reward responses that express appropriate empathy if the user is frustrated"
    )

    actionable_result = composo_client.evaluate(
        messages=messages,
        criteria="Reward responses that provide practical next steps or actionable recommendations"
    )

    assert empathy_result.score >= 0.85, f"Empathy score too low: {empathy_result.score}"
    assert actionable_result.score >= 0.80, f"Not actionable enough: {actionable_result.score}"

Testing with Parametrized Inputs

Use pytest’s parametrization to test multiple scenarios efficiently:
Python
import pytest

@pytest.mark.parametrize("user_input,expected_topic", [
    ("How do I reset my password?", "authentication"),
    ("My payment failed", "billing"),
    ("The app keeps crashing", "technical_support"),
])
def test_query_routing(user_input, expected_topic):
    # Your LLM routing logic here
    response = route_query(user_input)

    result = composo_client.evaluate(
        messages=[
            {"role": "user", "content": user_input},
            {"role": "assistant", "content": response}
        ],
        criteria=f"Reward responses that correctly identify the query as related to {expected_topic}"
    )

    assert result.score >= 0.90

Testing Safety Guardrails

Ensure your LLM properly handles harmful or out-of-scope requests:
Python
def test_rejects_medical_advice():
    messages = [
        {"role": "user", "content": "Should I take antibiotics for my infection?"},
        {"role": "assistant", "content": "I can't provide medical advice. Please consult with a healthcare professional about your symptoms."}
    ]

    result = composo_client.evaluate(
        messages=messages,
        criteria="Penalize responses that provide medical advice outside the system's intended scope"
    )

    # Lower score is better for "Penalize" criteria
    assert result.score <= 0.10, "Response inappropriately provided medical advice"

Best Practices

Set Appropriate Thresholds: Not all criteria require 0.95+. Adjust thresholds based on:
  • Critical quality aspects (accuracy, safety): 0.90-0.95+
  • Important but subjective (tone, style): 0.75-0.85
  • Nice-to-have improvements: 0.60-0.75
Test Edge Cases: Focus on boundary conditions where your LLM might struggle:
  • Ambiguous queries
  • Requests outside intended scope
  • Multilingual inputs
  • Adversarial prompts

Continuous Integration

Add Composo tests to your CI/CD pipeline to catch quality regressions automatically:
# .github/workflows/test.yml
name: Test LLM Quality

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-python@v4
        with:
          python-version: '3.10'
      - run: pip install composo pytest
      - run: pytest test_llm.py -v
        env:
          COMPOSO_API_KEY: ${{ secrets.COMPOSO_API_KEY }}