How to define a custom evaluator
Key concepts
Custom evaluators are just functions that take a dataset example and the resulting application output, and return one or more metrics. These functions can be passed directly into evaluate() / aevaluate().
Basic example
- Python
- TypeScript
from langsmith import evaluate
def correct(outputs: dict, reference_outputs: dict) -> bool:
"""Check if the answer exactly matches the expected answer."""
return outputs["answer"] == reference_outputs["answer"]
def dummy_app(inputs: dict) -> dict:
return {"answer": "hmm i'm not sure", "reasoning": "i didn't understand the question"}
results = evaluate(
dummy_app,
data="dataset_name",
evaluators=[correct]
)
import type { EvaluationResult } from "langsmith/evaluation";
const correct = async ({ outputs, referenceOutputs }: {
outputs: Record<string, any>;
referenceOutputs?: Record<string, any>;
}): Promise<EvaluationResult> => {
const score = outputs?.answer === referenceOutputs?.answer;
return { key: "correct", score };
}
Evaluator args
Custom evaluator functions must have specific argument names. They can take any subset of the following arguments:
run: Run
: The full Run object generated by the application on the given example.example: Example
: The full dataset Example, including the example inputs, outputs (if available), and metdata (if available).inputs: dict
: A dictionary of the inputs corresponding to a single example in a dataset.outputs: dict
: A dictionary of the outputs generated by the application on the giveninputs
.reference_outputs/referenceOutputs: dict
: A dictionary of the reference outputs associated with the example, if available.
For most use cases you'll only need inputs
, outputs
, and reference_outputs
. run
and example
are useful only if you need some extra trace or example metadata outside of the actual inputs and outputs of the application.
When using JS/TS these should all be passed in as part of a single object argument.
Evaluator output
Custom evaluators are expected to return one of the following types:
Python and JS/TS
dict
: dicts of the form{"score" | "value": ..., "name": ...}
allow you to customize the metric type ("score" for numerical and "value" for categorical) and metric name. This if useful if, for example, you want to log an integer as a categorical metric.
Currently Python only
int | float | bool
: this is interepreted as an continuous metric that can be averaged, sorted, etc. The function name is used as the name of the metric.str
: this is intepreted as a categorical metric. The function name is used as the name of the metric.list[dict]
: return multiple metrics using a single function.
Additional examples
- Python
from langsmith import evaluate, wrappers
from openai import AsyncOpenAI
# Assumes you've installed pydantic.
from pydantic import BaseModel
# Compare actual and reference outputs
def correct(outputs: dict, reference_outputs: dict) -> bool:
"""Check if the answer exactly matches the expected answer."""
return outputs["answer"] == reference_outputs["answer"]
# Just evaluate actual outputs
def concision(outputs: dict) -> int:
"""Score how concise the answer is. 1 is the most concise, 5 is the least concise."""
return min(len(outputs["answer"]) // 1000, 4) + 1
# Use an LLM-as-a-judge
oai_client = wrappers.wrap_openai(AsyncOpenAI())
async def valid_reasoning(inputs: dict, outputs: dict) -> bool:
"""Use an LLM to judge if the reasoning and the answer are consistent."""
instructions = """\
Given the following question, answer, and reasoning, determine if the reasoning for the \
answer is logically valid and consistent with question and the answer."""
class Response(BaseModel):
reasoning_is_valid: bool
msg = f"Question: {inputs['question']}\nAnswer: {outputs['answer']}\nReasoning: {outputs['reasoning']}"
response = await oai_client.beta.chat.completions.parse(
model="gpt-4o-mini",
messages=[{"role": "system", "content": instructions,}, {"role": "user", "content": msg}],
response_format=Response
)
return response.choices[0].message.parsed.reasoning_is_valid
def dummy_app(inputs: dict) -> dict:
return {"answer": "hmm i'm not sure", "reasoning": "i didn't understand the question"}
results = evaluate(
dummy_app,
data="dataset_name",
evaluators=[correct, concision, valid_reasoning]
)
Related
- Evaluate aggregate experiment results: Define summary evaluators, which compute metrics for an entire experiment.
- Run an evaluation comparing two experiments: Define pairwise evaluators, which compute metrics by comparing two (or more) experiments against each other.