avise.pipelines.languagemodel

avise.pipelines.languagemodel.pipeline

Base class for all vulnerability framework SETs.

All SETs inherit from BaseSETPipeline and should implement all 4 phases: initialize() -> execute() -> evaluate() -> report()

class avise.pipelines.languagemodel.pipeline.BaseSETPipeline[source]

Bases: ABC

The base Pipeline class for Language Model Security Evaluation Tests.

Based on a 4-phase pipeline with defined data contracts:

Phase 1 - initialize(config_path) -> List[LanguageModelSETCase] (Get SET configuration from a configuration file and return a list of SET cases) Phase 2 - execute(connector, sets) -> OutputData (Execute the SET cases against the defined model and return dataobjects for evaluation) Phase 3 - evaluate(execution_data) -> List[EvaluationResult] (Evaluate the test results and return evaluation objects) Phase 4 - report(results, output_path, format) -> ReportData (Take the evaluation objects and form a final report in a desired format run() - Runs all phases

Data flow:

initialize() —> List[LanguageModelSETCase] —> execute() —> OutputData(List[ExecutionOutput, execution_time]) —> evaluate() —> List[EvaluationResult] —> report() —> ReportData

When new language model SETs are designed, override these methods according to your needs. New evaluators, connectors, loaders, reporters, configurations, and SETs can be written and used as long as they follow this pipeline structure.

SUPPORTED_FORMATS = [ReportFormat.JSON, ReportFormat.HTML, ReportFormat.MARKDOWN]
static calculate_passrates(results: List[EvaluationResult]) Dict[str, Any][source]

Calculate summary statistics (pass%, fail%, error%) based on results.

Helper for report phase. Can be overwritten.

static calculate_subcategory_runs(results: List[EvaluationResult], subcategory_field: str = 'vulnerability_subcategory') Dict[str, int][source]

Calculate number of runs per vulnerability subcategory.

Parameters:
  • results – List of evaluation results

  • subcategory_field – Metadata field name for subcategory (default: vulnerability_subcategory)

Returns:

Dict mapping subcategory name to number of runs

description: str = ''
abstractmethod evaluate(execution_data: OutputData) List[EvaluationResult][source]

Evaluate the SET outputs with evaluators.

Parameters:

execution_data – OutputData from execute()

Returns:

Evaluation of each SET

Return type:

List[EvaluationResult]

Requirements:
  • Must produce one EvaluationResult per ExecutionOutput

  • Status must be “passed”, “failed”, or “error” TODO: Something else?

  • Reason should explain the SET status. Why did the SET pass, fail or cause an error?

abstractmethod execute(connector: BaseLMConnector, sets: List[LanguageModelSETCase]) OutputData[source]

Run the SETs against the target.

Parameters:
  • connector – A connector instance

  • sets – List of SET cases from initialize()

Returns:

All SET outputs along with the execution time.

Return type:

OutputData

Requirements:
  • Must produce one ExecutionOutput per LanguageModelSETCase.

  • Metadata from LanguageModelSETCase should be carried through for final report.

  • Errors should be placed to ExecutionOutput.error for later inspection.

generate_ai_summary(results: List[EvaluationResult], summary_stats: Dict[str, Any], subcategory_runs: Dict[str, int] | None = None) Dict[str, Any] | None[source]

Generate an AI summary of the security evaluation test results.

This is an optional helper method that can be called in the report phase to generate an AI-powered summary of the test results.

Parameters:
  • results – List of EvaluationResult from evaluate()

  • summary_stats – Summary statistics from calculate_passrates()

  • subcategory_runs – Optional dict of subcategory -> number of runs

abstractmethod initialize(set_config_path: str) List[LanguageModelSETCase][source]

Load and return SET cases from configuration files.

Parameters:

set_config_path – Path to SET configuration file

Returns:

SET cases used in the run

Return type:

List[LanguageModelSETCase]

Requirements:
  • Each SET case must at least contain an ID and a prompt

  • Additional data related to the SETs go to the metadata

name: str = ''
abstractmethod report(results: List[EvaluationResult], output_path: str, report_format: ReportFormat = ReportFormat.JSON, generate_ai_summary: bool = True) ReportData[source]

Generate the final report in the desired format and save it to target location.

Parameters:
  • results – List[EvaluationResult] from evaluate()

  • output_path – Path for output file (../user/reports/..)

  • report_format – Report format (Json, Toml, Yaml…) Set to JSON as default.

  • generate_ai_summary – Whether to generate AI summary (optional)

Returns:

The final report with all the SET data

Return type:

ReportData

Requirements:
  • Must write a report in the requested format to output_path

run(connector: BaseLMConnector, set_config_path: str, output_path: str, report_format: ReportFormat = ReportFormat.JSON, connector_config_path: str | None = None, generate_ai_summary: bool = True, runs: int = 1) ReportData[source]

Orchestration method that executes the 4-phase pipeline. This method gets called by the execution engine.

Parameters:
  • connector – A connector instance

  • set_config_path – Path to the SET configuration

  • output_path – Path where the output report is written

  • report_format – Desired output format

  • connector_config_path – Path to model configuration (for report metadata)

  • generate_ai_summary – Whether to generate AI summary

  • runs – How many times to run the SET

Returns:

The final report with all the SET data

Return type:

ReportData

Requirements:

Return the final report Calls other class methods with appropriate arguments

class avise.pipelines.languagemodel.pipeline.ReportFormat(*values)[source]

Bases: Enum

Available file formats.

HTML = 'html'
JSON = 'json'
MARKDOWN = 'md'

avise.pipelines.languagemodel.schema

Dataclasses for avise/pipelines/language_model/pipeline.py

class avise.pipelines.languagemodel.schema.EvaluationResult(set_id: str, prompt: str, response: str, status: str, reason: str, detections: Dict[str, ~typing.Any]=<factory>, metadata: Dict[str, ~typing.Any]=<factory>, elm_evaluation: str | None = None)[source]

Bases: object

Evaluation result of a single test

Produced by evaluate() function for each ExecutionOutput.

detections: Dict[str, Any]
elm_evaluation: str | None = None
metadata: Dict[str, Any]
prompt: str
reason: str
response: str
set_id: str
status: str
to_dict() Dict[str, Any][source]

Convert to dictionary for serialization.

class avise.pipelines.languagemodel.schema.ExecutionOutput(set_id: str, prompt: str, response: str, metadata: Dict[str, ~typing.Any]=<factory>, error: str | None = None)[source]

Bases: object

Single test execution / output result.

Produced by execute() for each test case.

error: str | None = None
metadata: Dict[str, Any]
prompt: str
response: str
set_id: str
to_dict() Dict[str, Any][source]
class avise.pipelines.languagemodel.schema.LanguageModelSETCase(id: str, prompt: str, metadata: Dict[str, ~typing.Any]=<factory>)[source]

Bases: object

Contract: Output of initialize(), input to execute().

ID and prompt are required fields that every SET case must contain. Additional fields can be added to ‘metadata’.

id: str
metadata: Dict[str, Any]
prompt: str
to_dict() Dict[str, Any][source]
class avise.pipelines.languagemodel.schema.OutputData(outputs: List[ExecutionOutput], duration_seconds: float)[source]

Bases: object

Output of execute(), input to evaluate().

Contains all execution outputs and execution duration in seconds.

duration_seconds: float
outputs: List[ExecutionOutput]
to_dict() Dict[str, Any][source]
class avise.pipelines.languagemodel.schema.ReportData(set_name: str, timestamp: str, execution_time_seconds: float | None, summary: ~typing.Dict[str, ~typing.Any], results: ~typing.List[~avise.pipelines.languagemodel.schema.EvaluationResult], configuration: ~typing.Dict[str, ~typing.Any] = <factory>, ai_summary: ~typing.Dict[str, ~typing.Any] | None = <factory>, group_results: bool = True)[source]

Bases: object

Output of the report phase / function.

The final report structure that is serialized to the desired format based on the given command line argument.

ai_summary: Dict[str, Any] | None
configuration: Dict[str, Any]
execution_time_seconds: float | None
group_by_vulnerability() Dict[str, List[EvaluationResult]][source]

Group results by vulnerability_subcategory field in metadata.

Returns:

Dict mapping set_category to list of results

group_results: bool = True
results: List[EvaluationResult]
set_name: str
summary: Dict[str, Any]
timestamp: str
to_dict() Dict[str, Any][source]