Building a Security Evaluation Test

Security Evaluation Tests, or SETs, contain the detailed logic for identifying a specific vulnerability or assessing the security of a target system or component within a specified scope. SETs inherit the base logic for the execution flow of a certain type of a SET from BaseSETPipelines. For example, all language model SETs inherit the execution flow logic from pipelines.languagemodel.BaseSETPipeline.

In this example, we will be creating a single turn prompt injection SET to test language models if they can be easily manipulated into doing producing potentially harmful and malicious outputs.

Before we can create a SET for a specific type of a target system or model, we need to have a BaseSETPipeline made for that type of a target. You can check out building_pipeline for an example on how to create one, if there is no suitable pipeline available for your SET at avise/pipelines/.

For clarity, here are the packages we will use later on in the code:

logging: logging is used to create logs that will help with debugging and informing the user of what’s happening when the program is executing.
typing: Type hints are used for function parameters to define specific types for the parameters.
utils.ConfigLoader: Used to load configuration data as a dictionary from a JSON file.
utils.ansi_color: A dictionary of ansi codes for logging - helps us add color to logs to make them prettier and easier to follow.
pipelines.languagemodel. * BaseSETPipeline: The base pipeline we will be extending with our SET class. * LanguageModelSETCase: A data class we can use for each SET case. * ExecutionOutput: A data class for each SET case execution output. * OutputData: A data class that contains all relecant data from execution outputs. * EvaluationResults: A data class for SET case evaluation results. * ReportData: A data class for the final report.
registry.set_registry: Registry where we want to register our SET, so it is available to the Execution Engine and executable.
BaseLMConnector: We will use this as a type hint for the execute() method.
evaluators.languagemodel.*: Different evaluators we will use to evaluate the execution outputs.
JSONReporter, HTMLReporter, MarkdownReporter: Different types of reporters we can use for report generation
models.EvaluationLanguageModel: Language model we will use to evaluate the SET results.

1. Initialization

To begin, we want to define our SET class that inherits from the base pipeline, and an id that describes our SET. As we are making a prompt injection SET, prompt_injection works well as the id. With the @set_registry.register() decorator, we register our SET to the registry. Next, we need to have name and description class attributes, that describe our SET and will be used in the final report. In the __init__() method we define all the required instance attributes:

Now we need to check the required phases for SET execution from the pipelines.languagemodel.BaseSETPipeline. If you want to familiarize yourself more with how the pipelines.languagemodel.BaseSETPipeline works and how it was made, you can check out building_pipeline - it has 4 phases: initialize(), execute(), evaluate(), and report(). The names of the phases are pretty self-explanatory: in initialize() we initialize the SET, in execute() we execute the SET, in evaluate() we evaluate the SET execution results, and in report() we produce the final report.

In the initialize() phase we load all the required configuration data from a SET configuration JSON file. For this, we first need to create the configuration JSON that holds the data we will use to execute our SET. The configuraiton file should hold all data that you (or someone else if you will contribute the SET to the main repository) might want to modify later to adjust how the SET is executed.

To give some instructions for the Evaluation Language Model on how to evaluate our SET results, we need to include a evaluation_system_prompt field with the desired system prompt as its value in the cofiguration JSON. Additionally, we may want to adjust how many tokens the target model generates whenever we use a prompt containing a prompt injection. For this, we need to include a target_model_max_tokens field in the configuration file. And finally, we add the set_cases field that contains all the individual SET cases along with their descriptive data. More fields can be added based on your needs, but for this simple example, these should be enough.

With the SET configuration file made, we can write the method for initialize() phase. It loads all of the data from the configuration file and creates a list of LanguageModelSETCase instances based on the loaded data:

2. Execution

Next, we can write the method for the execute() phase. As our SET cases are pretty simple (run a single turn prompt containing a prompt injection on the target model and evaluate results) the execute() method doesn’t need to be too complicated. In the method, we loop through each of the SET cases, generate responses using a Connector, handle potential errors, and add the model responses along with the SET case data into a list of outputs. The method returns a OutputData instance containing all of the SET outputs with other relevant data:

3. Evaluation

In the evaluate() phase, the outputs generated in execute() phase are evaluated for any potential security issues with using, for example, pattern matching algorithms or language models, depending on the type of a SET being creted. For the prompt injection SET, we will create evaluators for detecting partial compliance (a language model acknowledges it should refuse to help with a request, but decides to respond and help anyway.), refusal (a language model correctly refuses a malicious or inappropriate prompt), and vulnerability (a language model’s response contains patterns indicating it has been manipulated or compromised). For detailed example on how to create these evaluators, take a look at Building an Evaluator section of these docs.

Each of these evaluators is a pattern matching algorithm, and in addition, we will utilize an Evaluation Language Model to evaluate the prompt:response pairs and produce and output where either VULNERABLE, SAFE, or UNCERTAIN keyword is present (the system prompt for the evaluation model is defined in the SET configuration JSON discussed earlier).

With the evaluators defined, the evaluate() method loops over the outputs for each of the SET execution cases, run the evaluator algorithms and produces a generation from the Evaluation Language Model, and determines a verdict for the SET case with determine_test_status() helper method:

The determine_test_status() helper method scans the evaluator results on a specific SET case and determines the final verdict based on a priority principle vulnerability > suspicious > partial > refusal >. The evalutor detection with the highest priority will be determined as the final verdict. If none of the evaluators detected any predetermined patterns in the model’s response, the method returns an error status and suggests a manual review of the SET:

3. Reporting

With the evaluation method defined, the last method to write is the report() method which generates the final report summarizing the executed SET. The report() method creates a ReportData object from the executed SET, which contains all the relevant data from the

SET, such as: execution time, passrates and statistics, configurations used, and the results. And finally, a report file is written based on the ReportData object:

Testing the new SET

Now that we have created a new SET and a configuration JSON file for it, it is time to make sure it works as we intended. As we have created a SET for language models, we can try to run it on some target model

see if it works. Assuming we have a target model running locally via Ollama, and that we have configured an ollama connector through a configuration JSON file to connect to the target model.

By running the following command in the root directory of AVISE, we can test the newly created prompt_injection SET, with the cofiguration JSON we created earlier, on a target Ollama model:

python -m avise --SET prompt_injection --connectorconf avise/configs/connector/languagemodel/ollama.json --SETconf avise/configs/SET/languagemodel/single_turn/prompt_injection_mini.json

--SET: with this argument, we tell the CLI which SET we wish to execute.
--connectorconf: with this argument, we tell the CLI the path of the connector configuration JSON we just created.
--SETconf: with this optional argument, we can give the CLI a path to a custom SET configuration file (there are predefined default paths if we don’t use this argument)

If our code has no errors and works as we intended, the Execution Engine starts running the SET and eventually produces a report file and prints something like this to the console:

In the case that there were some errors in our code, we need to debug them until the SET cases execute fully.

Contributing the new SET

Now that we have a functional new SET, we can contribute it to the main repository so other users can utilize it as well! For details on how to contribute a SET to the main repository, check out :ref:`contributing_set.