Building an Evaluator
Evaluators are the components responsible for inspecting a target system’s output and determining
whether it contains signals of interest — such as a security vulnerability, a correct refusal, or
an unexpected behaviour. They are called during the evaluate() phase of a SET
and their findings can drive the final "passed", "failed", or "error" verdict for each
SET case.
Evaluators are intentionally kept modular and independent. Each evaluator encapsulates a single,
well-defined detection concern. A SET’s evaluate() method can compose several evaluators
together, and the combined findings are then passed to a verdict-determination step (such as
determine_test_status()) that decides the final outcome. This separation makes evaluators
easy to reuse across different SETs and target types.
Note
The evaluator system is not limited to any one detection strategy. While the language model
evaluators used as an example on this page use regex pattern matching for speed and transparency,
the detect() interface places no constraints on the logic inside. An evaluator can call an
external API, run a classifier, parse structured data, compare numeric values, or apply any
other mechanism appropriate for the type of output being evaluated. The examples in this guide
illustrate this range of possibilities.
Existing evaluators can be found at
avise/evaluators/. If none suit your needs, this guide will walk you through
creating a new one.
Overview: The Evaluator Contract
Every evaluator must satisfy a minimal contract defined by a BaseEvaluator. For example,
evaluators extending the BaseLMEvaluator, intended for evaluating language model outputs,
must adhere to these rules:
It must declare a
nameand adescriptionas class attributes.It must implement a
detect()method that accepts a response string and returns aTuple[bool, List[str]]— a detection flag and a list of human-readable findings.
That is the entire interface. What happens inside detect() is up to the implementor.
evaluator.detect(response)
│
▼
(detected: bool, findings: List[str])
│
▼
collected into EvaluationResult.detections{}
│
▼
determine_test_status() → "passed" / "failed" / "error"
The findings list is stored verbatim in the detections field of EvaluationResult
within a SET and appears in the final report, so its contents should be meaningful to a human reviewer.
1. The Base Class for Language Model Evaluators
BaseLMEvaluator is an abstract base class that defines the shared interface and provides an
optional regex-matching helper for evaluators that choose to use it. It declares three class
attributes and two methods.
The _find_pattern_matches() helper is a convenience provided for regex-based evaluators. It
is entirely optional — evaluators that use different detection strategies simply do not call it.
The patterns class attribute can be left as an empty list or omitted when not needed.
2. Writing a New Evaluator
The steps to create a new evaluator are always the same, regardless of the detection logic used:
Create a new
.pyfile underavise/evaluators/in the appropriate subdirectory for your target type (e.g.languagemodel/,multimodal/).Define a class that inherits from a Base Evaluator abstract class, register it with
@evaluator_registry.register(), and setnameanddescription.Implement
detect()with whatever logic suits your target output and detection goal.
The skeleton below shows the minimal structure every language model evaluator must have:
The @evaluator_registry.register() decorator takes the same string as name and makes
the evaluator discoverable throughout the framework without requiring manual imports elsewhere than
in the avise/evaluators/<AI_SYSTEM_TYPE>/__init__.py.
3. Detection Logic: Approaches and Examples
The following examples illustrate a range of detection strategies that can be used inside
detect(). They are not prescriptive — choose or combine whichever approach is appropriate
for the output type and the behaviour you want to detect.
Approach A: Regex pattern matching
The simplest approach for text based outputs. Define a patterns list of regex strings as a class attribute and
delegate to _find_pattern_matches() inside detect(). This is a good fit when the signals
of interest are expressible as surface-level text patterns — specific keywords, structural
phrases, or formatting signatures in a text response.
Tip
A few regex conventions used across the built-in evaluators are worth following for
consistency: prefix patterns with (?i) to make them case-insensitive; use \s+ rather
than a literal space to tolerate extra whitespace; and use bounded wildcards like .{1,100}
rather than .* when matching text between two anchors, to prevent false positives spanning
multiple sentences.
Approach B: Custom rule-based logic
When the detection condition cannot be expressed as a single regex — for example, when multiple
conditions must hold simultaneously, when numerical thresholds are involved, or when the finding
needs to carry a computed value — implement the logic directly in detect() and build the
findings list yourself. The patterns attribute can be left empty.
Approach C: Structured output validation
When the target system produces structured output — JSON, XML, a Python dict — rather than free
text, detect() can parse and validate that structure directly. This is well-suited to
evaluating function-calling models, agents that return tool invocations, or any system where the
output format is machine-readable and has a defined schema.
Approach D: External model or API call
For evaluations that require semantic understanding — such as determining whether a response is
factually correct, contextually appropriate, or ideologically biased — detect() can delegate
to an external classifier, a second language model, or any API. This is more expensive than the
previous approaches and should be used when simpler methods are insufficient.
Note
Evaluators that call external resources should handle failures gracefully. Consider catching
exceptions inside detect() and returning a descriptive finding string rather than letting
the exception propagate — this keeps the pipeline running and surfaces the failure clearly in
the report rather than crashing execution mid-run.
4. Using Evaluators in a SET
Evaluators are instantiated in the SET’s __init__() and called inside evaluate(). Their
findings are collected into the detections dictionary of each EvaluationResult, and a
verdict-determination helper then interprets the combined findings to produce the final status.
The verdict-determination logic (determine_test_status()) decides which evaluator signals
take priority. For example, a schema violation might immediately fail a SET case regardless of other
findings, while a high response length might only be treated as a failure when combined with
additional signals. The right priority ordering depends entirely on what your SET is measuring.
See building_set for a complete worked example of determine_test_status().
Contributing a New Evaluator
If you have written an evaluator that is used in a SET or could be applicable for some SETs, consider contributing it to the main repository so it can be used. For details on the contribution process, see Contributing to AVISE.