Skip to main content
Composo delivers deterministic, accurate evaluation for LLM applications through purpose-built generative reward models. Unlike unreliable LLM-as-judge approaches, our specialized models provide consistent, precise scores you can trust—with just a single sentence criteria.

Quickstart

Get up and running with Composo in under 5 minutes. This guide will help you evaluate your first LLM response and understand how Composo delivers deterministic, accurate evaluations.

Step 1: Create Your Account

Sign up for a Composo account at platform.composo.ai.

Step 2: Generate Your API Key

  1. Navigate to ProfileAPI Keys in the dashboard
  2. Click Generate New API Key

Step 3: Run Your First Evaluation

[Optional] Install the SDK:
pip install composo
Now let’s evaluate a customer service response for empathy and helpfulness using the Composo SDK:
from composo import Composo

# Initialize the client with your API key
composo_client = Composo(api_key="YOUR_API_KEY")

# Example: Evaluating a customer service response
result = composo_client.evaluate(
    messages=[
        {"role": "user", "content": "I'm really frustrated with my device not working."},
        {"role": "assistant", "content": "I'm sorry to hear that you're experiencing issues with your device. Let's see how I can assist you to resolve this problem."}
    ],
    criteria="Reward responses that express appropriate empathy if the user is facing a problem they're finding frustrating"
)

# Display results
print(f"Score: {result.score}")
print(f"Analysis: {result.explanation}")

Understanding the Results

Composo returns:
  • Score: A value between 0 and 1 (e.g. 0.86 means the response strongly meets your criteria)
  • Explanation: Detailed analysis of why the response received this score
Example output:
JSON
Score: 0.86/1.0
Analysis: - The assistant directly acknowledges the user's difficulty and expresses sympathy ("I'm sorry to hear that you're experiencing issues"), showing clear empathy.
- The response is timely and supportive, immediately addressing the expressed frustration and not ignoring the emotional content.
- It constructively adds a collaborative next step ("Let's see how I can assist you"), enhancing the empathetic tone, with only minor room for deeper emotional mirroring.

Step 4: Evaluate Agents with Tracing

For agent applications, Composo provides real-time tracing to capture and evaluate multi-agent interactions. Here’s a simple example with an orchestrator coordinating two sub-agents:
Python
from composo import Composo
from composo.models import criteria
from composo.tracing import ComposoTracer, Instruments, AgentTracer, agent_tracer
from openai import OpenAI

# Initialize tracing for OpenAI
ComposoTracer.init(instruments=[Instruments.OPENAI])
composo_client = Composo(api_key="YOUR_API_KEY")
openai_client = OpenAI()

# Define a simple sub-agent
@agent_tracer(name="research_agent")
def research_agent(topic):
    return openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": f"Research: {topic}"}],
        max_tokens=50
    )

# Orchestrator coordinates multiple agents
with AgentTracer("orchestrator") as tracer:
    # First sub-agent: planning
    with AgentTracer("planning_agent"):
        plan = openai_client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": "Plan a trip to Paris"}],
            max_tokens=50
        )

    # Second sub-agent: research
    research = research_agent("Paris attractions")

# Evaluate the full agent trace
results = composo_client.evaluate_trace(tracer.trace, criteria=criteria.agent)

for result, criterion in zip(results, criteria.agent):
    print(f"Criterion: {criterion}")
    print(f"Evaluation Result: {result}\n")
This example shows how Composo traces each agent’s LLM calls independently and evaluates them against our comprehensive agent framework.