Building an Evaluator

Evaluators are the components responsible for inspecting a target system’s output and determining whether it contains signals of interest — such as a security vulnerability, a correct refusal, or an unexpected behaviour. They are called during the evaluate() phase of a SET and their findings can drive the final "passed", "failed", or "error" verdict for each SET case.

Evaluators are intentionally kept modular and independent. Each evaluator encapsulates a single, well-defined detection concern. A SET’s evaluate() method can compose several evaluators together, and the combined findings are then passed to a verdict-determination step (such as determine_test_status()) that decides the final outcome. This separation makes evaluators easy to reuse across different SETs and target types.

Note

The evaluator system is not limited to any one detection strategy. While the language model evaluators used as an example on this page use regex pattern matching for speed and transparency, the detect() interface places no constraints on the logic inside. An evaluator can call an external API, run a classifier, parse structured data, compare numeric values, or apply any other mechanism appropriate for the type of output being evaluated. The examples in this guide illustrate this range of possibilities.

Existing evaluators can be found at avise/evaluators/. If none suit your needs, this guide will walk you through creating a new one.

Overview: The Evaluator Contract

Every evaluator must satisfy a minimal contract defined by a BaseEvaluator. For example, evaluators extending the BaseLMEvaluator, intended for evaluating language model outputs, must adhere to these rules:

It must declare a name and a description as class attributes.
It must implement a detect() method that accepts a response string and returns a Tuple[bool, List[str]] — a detection flag and a list of human-readable findings.

That is the entire interface. What happens inside detect() is up to the implementor.

evaluator.detect(response)
      │
      ▼
(detected: bool, findings: List[str])
      │
      ▼
collected into EvaluationResult.detections{}
      │
      ▼
determine_test_status() → "passed" / "failed" / "error"

The findings list is stored verbatim in the detections field of EvaluationResult within a SET and appears in the final report, so its contents should be meaningful to a human reviewer.

1. The Base Class for Language Model Evaluators

BaseLMEvaluator is an abstract base class that defines the shared interface and provides an optional regex-matching helper for evaluators that choose to use it. It declares three class attributes and two methods.

The _find_pattern_matches() helper is a convenience provided for regex-based evaluators. It is entirely optional — evaluators that use different detection strategies simply do not call it. The patterns class attribute can be left as an empty list or omitted when not needed.

2. Writing a New Evaluator

The steps to create a new evaluator are always the same, regardless of the detection logic used:

Create a new .py file under avise/evaluators/ in the appropriate subdirectory for your target type (e.g. languagemodel/, multimodal/).
Define a class that inherits from a Base Evaluator abstract class, register it with @evaluator_registry.register(), and set name and description.
Implement detect() with whatever logic suits your target output and detection goal.

The skeleton below shows the minimal structure every language model evaluator must have:

The @evaluator_registry.register() decorator takes the same string as name and makes the evaluator discoverable throughout the framework without requiring manual imports elsewhere than in the avise/evaluators/<AI_SYSTEM_TYPE>/__init__.py.

3. Detection Logic: Approaches and Examples

The following examples illustrate a range of detection strategies that can be used inside detect(). They are not prescriptive — choose or combine whichever approach is appropriate for the output type and the behaviour you want to detect.

Approach A: Regex pattern matching

The simplest approach for text based outputs. Define a patterns list of regex strings as a class attribute and delegate to _find_pattern_matches() inside detect(). This is a good fit when the signals of interest are expressible as surface-level text patterns — specific keywords, structural phrases, or formatting signatures in a text response.

Tip

A few regex conventions used across the built-in evaluators are worth following for consistency: prefix patterns with (?i) to make them case-insensitive; use \s+ rather than a literal space to tolerate extra whitespace; and use bounded wildcards like .{1,100} rather than .* when matching text between two anchors, to prevent false positives spanning multiple sentences.

Approach B: Custom rule-based logic

When the detection condition cannot be expressed as a single regex — for example, when multiple conditions must hold simultaneously, when numerical thresholds are involved, or when the finding needs to carry a computed value — implement the logic directly in detect() and build the findings list yourself. The patterns attribute can be left empty.

Approach C: Structured output validation

When the target system produces structured output — JSON, XML, a Python dict — rather than free text, detect() can parse and validate that structure directly. This is well-suited to evaluating function-calling models, agents that return tool invocations, or any system where the output format is machine-readable and has a defined schema.

Approach D: External model or API call

For evaluations that require semantic understanding — such as determining whether a response is factually correct, contextually appropriate, or ideologically biased — detect() can delegate to an external classifier, a second language model, or any API. This is more expensive than the previous approaches and should be used when simpler methods are insufficient.

Note

Evaluators that call external resources should handle failures gracefully. Consider catching exceptions inside detect() and returning a descriptive finding string rather than letting the exception propagate — this keeps the pipeline running and surfaces the failure clearly in the report rather than crashing execution mid-run.

4. Using Evaluators in a SET

Evaluators are instantiated in the SET’s __init__() and called inside evaluate(). Their findings are collected into the detections dictionary of each EvaluationResult, and a verdict-determination helper then interprets the combined findings to produce the final status.

The verdict-determination logic (determine_test_status()) decides which evaluator signals take priority. For example, a schema violation might immediately fail a SET case regardless of other findings, while a high response length might only be treated as a failure when combined with additional signals. The right priority ordering depends entirely on what your SET is measuring. See building_set for a complete worked example of determine_test_status().

Contributing a New Evaluator

If you have written an evaluator that is used in a SET or could be applicable for some SETs, consider contributing it to the main repository so it can be used. For details on the contribution process, see Contributing to AVISE.