The following article outlines methods for improving Policy Assessment and compliance rating results from Visual Agents.

For information on how to configure and improve the accuracy of Action triggers, see Setting up Actions for Visual Agents.

Run Policy Tests from the website
Creating a Golden Dataset and running Batch Tests

Run Policy Tests from the website

The quickest way to perform Policy Tests and observe your Policy in action is through the Visual Agents Tests page at https://ai.camio.com/tests. The Tests page provides custom inputs for overrides to the configured Policy and Focus Areas of the Policy Test, allowing for quick iteration and testing of Policy changes.

See How can I test policies with my own video samples? for more information on how to use the Tests page.

Screenshot 2025-03-31 at 12.03.10 PM.png

Creating a Golden Dataset and running Batch Tests

Note: The following features are currently in development and restricted to Visual Agents subscribers only.

For those looking to perform a more thorough testing process, we recommend creating a Golden Dataset and running tests with the Batch Tester.

While the Tests page only allows one Policy Assessment to be performed at a time, the Batch Tester enables performing multiple Policy Assessments on large quantities of data, and provides precise metrics on the overall quality of results generated from the custom Policy.

The process for performing this requires Creating a Golden Dataset from existing videos and images, then feeding the dataset and the desired Policy into and Running Batch Tests using a conveniently packaged Docker service.

Prerequisites

The following packaged service requires an installation of Docker, which can be downloaded from their official website.

Creating a Golden Dataset

In order to confidently test large quantities of data against a Policy to confirm its performance, you will need to create a Golden Dataset of video data to test your Policy against.

A Golden Dataset is a carefully vetted and manually refined collection of data that serves as a source of truth to compare against the decision making of the Visual Agents, in this case, the compliance of a given video. It allows for the determination of Policy Assessment performance by comparing the compliance evaluation determined to the expected or desired outcome.

Because the Golden Dataset is the sole basis to determine whether Policy Assessments are correct or not, it is important that the data provided is as accurate as possible. Otherwise, it may affect the accuracy of the performance metrics calculated for your custom Policy.

Camio currently provides a method for labeling Camio Events for the purpose of Golden Dataset generation, through Camio's Event Labeler.

Following the steps in the article, manually label any Event you find with its actual compliance state, either "compliant" or "noncompliant". These manually confirmed Events can be fetched in bulk for use as a Golden Dataset with the Batch Tester.

Running Batch Tests

Any Golden Dataset you have created with the above method can be downloaded locally and tested against any custom Policy using our Batch Tester tool.

The Batch Tester is publicly downloadable from here.

The Batch Tester package will come as a compressed ZIP, including instructions for using both the service to download Golden Datasets and the main Batch Tester service.

At a minimum, the services require the following:

An Access Token from a Camio account with Can View permission on the Event data labeled for Golden Dataset usage, which can be generated from the Integrations page in the Camio account Settings.
A Visual Agents key from a paid Visual Agents account, which can be generated from the Visual Agents account's Keys page.

For instructions on how to use the Batch Tester, please refer to the README.md file included in the main directory of the ZIP.

Understanding the Results of a Batch Test

After successful completion, the Batch Test will provide the following items:

The original JSON results for each Policy Test run
A CSV of the links to view the result of each Policy Test run from the Visual Agents website
A CSV of the Confusion Matrices, by camera if applicable
A CSV of the Performance Metrics, by camera if applicable

While the individual JSON results are included to allow for manual inspection of the results from the test, the Confusion Matrices and Performance Metrics provide the overall evaluation of the test results for easier viewing and analysis.

Confusion Matrices

A Confusion Matrix is a simple four-by-four representing the overall performance of a system based on whether its judgements were correct or incorrect.

In the case of Policy Testing, we provide a Confusion Matrix indicating whether Visual Agents was able to correctly flag an assessment for a Policy violation.

The Positive or Negative indicates whether Visual Agents flagged a Policy violation or not, with positive indicating that a violation was flagged, or that it was Noncompliant.

The True or False indicates whether the judgement Visual Agents made was correct. For example: if an Event that is actually Noncompliant is marked as Compliant, this would be considered a False Negative.

Performance Metrics

The Performance Metrics provided are based on calculations from the Confusion Matrices, and represent different ratios of correct and incorrect evaluations based on the Policy. The provided Performance Metrics include Accuracy, Specificity, Precision, and Recall.

As it is important to understand what these metrics represent in order to properly determine which metrics are most relevant towards the desired Policy improvements, provided below are basic definitions of the performance metrics in relation to the Batch Tester, and explanations of their relevance to various goals one might have for Policy improvement:

Accuracy is the ratio of all correctly assessed, or True results, over all results.
- This is best likened to the overall grade on the Batch Test. Accuracy is a metric relevant to the general ability to correctly identify results based on the provided Policy, and is not specific to the type of result.
Specificity is the ratio of Truly Compliant results over all actually Compliant results.
- This metric specifically indicates the accuracy of Visual Agents in using the Policy to correctly determine that an event is Compliant, and may be more relevant if the goal of testing the Policy is to decrease the sensitivity or rate of flagging violations.
Precision is the ratio of Truly Noncompliant results over all results Visual Agents determined to be Noncompliant, both correctly and incorrectly.
- This metric indicates the overall quality of flagged violations, and is relevant if the goal of testing the Policy is to improve the quality of flagging violations, or decrease the number of false flags.
Recall is the ratio of Truly Noncompliant results over all actually Noncompliant results.
- If Precision represents the quality of of correctly flagged violations, then Recall best represents the quantity of correctly flagged violations, and is relevant if the goal of testing the Policy is to increase the sensitivity or rate of flagging violations.

Improving Results from Visual Agents