Skip to main content

Overview

The Composo class provides a synchronous client for evaluating chat messages against custom criteria. Suitable for single evaluations or small batch scenarios with automatic retry mechanisms.

Constructor

from composo import Composo

client = Composo(
    api_key="your_api_key",
    base_url="https://platform.composo.ai",
    num_retries=1,
    model_core=None,
    timeout=60.0
)

Parameters

api_key
string
Your Composo API key for authentication. If not provided, will be loaded from the COMPOSO_API_KEY environment variable.
base_url
string
default:"https://platform.composo.ai"
API base URL. Change only if using a custom Composo deployment.
num_retries
integer
default:"1"
Number of retries on request failure. Each retry uses exponential backoff with jitter. Minimum value is 1 (retries cannot be disabled).
model_core
string
Optional model core identifier for specifying the evaluation model. If not provided, uses the default evaluation model.
timeout
float
default:"60.0"
Request timeout in seconds. Total time to wait for a single request (including retries).

Example

from composo import Composo

# Using API key directly
client = Composo(api_key="your_api_key_here")

# Using environment variable
import os
os.environ["COMPOSO_API_KEY"] = "your_api_key_here"
client = Composo()

# With custom configuration
client = Composo(
    api_key="your_api_key",
    num_retries=3,
    timeout=120.0
)

evaluate()

Evaluate messages against one or more evaluation criteria.
result = client.evaluate(
    messages=[...],
    criteria="Your evaluation criterion",
    system=None,
    tools=None,
    result=None,
    block=True
)

Parameters

messages
list[dict]
required
List of chat messages to evaluate. Each message should be a dictionary with role and content keys.Supported roles: system, user, assistant, toolExample:
[
    {"role": "user", "content": "Hello!"},
    {"role": "assistant", "content": "Hi there!"}
]
criteria
string | list[string]
Evaluation criterion or list of criteria. Can be a custom criterion string or use pre-built criteria from composo.criteria.Example:
"Reward helpful and accurate responses"
# or
["Criterion 1", "Criterion 2", "Criterion 3"]
system
string
Optional system message to set AI behavior and context for the evaluation.
tools
list[dict]
Optional list of tool definitions for evaluating tool calls. Each tool should follow the OpenAI function calling format.
result
dict
Optional LLM result to append to the conversation for evaluation.
block
boolean
default:"True"
If False, returns a dictionary with task_id instead of blocking for results. Use for async job submission.

Returns

result
EvaluationResponse | list[EvaluationResponse]
  • Returns single EvaluationResponse if one criterion provided
  • Returns list[EvaluationResponse] if multiple criteria provided
  • Returns dict with task_id if block=False

Response Schema

EvaluationResponse
score
float | null
Evaluation score between 0.0 and 1.0. Returns null if the criterion was deemed not applicable.
explanation
string
Detailed explanation of the evaluation score and reasoning.

Examples

Basic Evaluation

from composo import Composo

client = Composo()

messages = [
    {"role": "user", "content": "What's the capital of France?"},
    {"role": "assistant", "content": "The capital of France is Paris."}
]

result = client.evaluate(
    messages=messages,
    criteria="Reward accurate and informative responses"
)

print(f"Score: {result.score}")
# Output: Score: 0.95

print(f"Explanation: {result.explanation}")
# Output: Explanation: The response correctly identifies Paris as the capital of France...

Multiple Criteria Evaluation

results = client.evaluate(
    messages=[...],
    criteria=[
        "Reward accurate information",
        "Reward clear communication",
        "Penalize overly technical jargon"
    ]
)

for result in results:
    print(f"Score: {result.score} - {result.explanation}")

Tool Call Evaluation

messages = [
    {"role": "user", "content": "What's the weather in SF?"},
    {
        "role": "assistant",
        "content": None,
        "tool_calls": [{
            "id": "call_123",
            "type": "function",
            "function": {
                "name": "get_weather",
                "arguments": '{"location": "San Francisco"}'
            }
        }]
    },
    {
        "role": "tool",
        "tool_call_id": "call_123",
        "content": '{"temp": 65, "condition": "sunny"}'
    },
    {"role": "assistant", "content": "It's 65°F and sunny in San Francisco!"}
]

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string"}
            }
        }
    }
}]

result = client.evaluate(
    messages=messages,
    tools=tools,
    criteria="Reward correct tool usage and accurate responses"
)

Non-blocking Evaluation

# Submit evaluation without waiting
response = client.evaluate(
    messages=[...],
    criteria="Your criterion",
    block=False
)

task_id = response["task_id"]
print(f"Task submitted with ID: {task_id}")
# Use task_id to check status later

evaluate_trace()

Evaluate multi-agent traces with full conversation history across multiple agents.
result = client.evaluate_trace(
    trace=trace_object,
    criteria="Your evaluation criterion",
    model_core=None,
    block=True
)

Parameters

trace
MultiAgentTrace
required
Multi-agent trace object containing agent interactions, initial input, and final output.
criteria
string | list[string]
required
Evaluation criterion or list of criteria for trace evaluation.
model_core
ModelCore
Optional model core identifier for trace evaluation.
block
boolean
default:"True"
If False, returns a dictionary with task_id instead of blocking for results.

Returns

result
MultiAgentTraceResponse | list[MultiAgentTraceResponse]
  • Returns single MultiAgentTraceResponse if one criterion provided
  • Returns list[MultiAgentTraceResponse] if multiple criteria provided
  • Returns dict with task_id if block=False

Response Schema

MultiAgentTraceResponse
agent_scores
dict
Per-agent evaluation scores mapping agent IDs to their individual scores.
overall_score
float
Overall trace score aggregated across all agents.
explanation
string
Detailed explanation of the trace evaluation.
criterion
string
The criterion that was evaluated.

Example

from composo import Composo, ComposoTracer, Instruments, AgentTracer
from openai import OpenAI

# Initialize tracing
ComposoTracer.init(instruments=Instruments.OPENAI)
openai_client = OpenAI()
composo_client = Composo()

# Use AgentTracer context manager to capture trace
with AgentTracer(name="research_agent") as tracer:
    response = openai_client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": "Research: quantum computing"}]
    )
    result = response.choices[0].message.content

    # Get the trace object
    trace = tracer.trace

# Evaluate the captured trace
evaluation = composo_client.evaluate_trace(
    trace=trace,
    criteria="Reward thorough research and accurate information"
)

print(f"Overall Score: {evaluation.overall_score}")
print(f"Explanation: {evaluation.explanation}")

Context Manager Usage

The Composo client supports context managers for automatic resource cleanup:
with Composo() as client:
    result = client.evaluate(
        messages=[...],
        criteria="Your criterion"
    )
    print(result.score)
# Client automatically closed