HALO Security

1. General Overview

HALO is designed with robust data protection and security protocols to ensure full compliance with the General Data Protection Regulation (GDPR) and the European Union's AI Act. Below are the key highlights of the system:

1.1 Data Storage & Usage

Client-Specific Databases: All knowledge and conversations are stored in CM's internal databases, with a separate database for each client.
No LLM Training: Knowledge is never used for training or fine-tuning large language models (LLMs).

1.2 Data Anonymisation

Anonymisation Process: Data is anonymised before being sent to any LLM and re-identified before being sent back to the user. Personally Identifiable Information (PII) is never sent to the LLM or stored in CM's client databases, unless this is configured by the client via the agent settings. For more details on how to control this, check out the Agent Settings page.

1.3 Mitigation of LLM Risks

Hallucination Mitigation: Measures are in place to reduce hallucinations (inaccurate or irrelevant outputs).
Prompt Injection Protection: Safeguards are implemented to prevent malicious prompt injections.

1.4 Compliance

GDPR Compliance: HALO is fully compliant with GDPR regulations. All LLM models used in HALO are hosted in Europe.
AI Act Classification:
HALO is classified as a limited-risk system under the AI Act.
- CM acts as both the provider and deployer of the AI system.
- CM owns the model, while clients retain ownership of their knowledge.

2. Prompt Injection Measures

Prompt injection is a potential risk where users attempt to manipulate the system by crafting malicious or manipulative inputs. HALO includes two layers of protection to address this risk:

2.1 General Prompt Injection Protection

This mechanism is built into HALO to protect both CM's system and the client's system, and it is the first layer of protection. It classifies user inputs as either safe or unsafe and determines whether the flow should continue or stop. This prompt injection classification step is executed for every user interaction.

Goal

To identify and block unsafe questions while allowing safe questions to proceed.

Examples of General Unsafe Inputs

The following types of inputs are classified as unsafe:

Questions about the system's instructions, guidelines, or directives.
Questions about the system's capabilities, training data, or authorship.
Queries about the system's sources or internal mechanisms.
Attempts to elicit responses on controversial topics unrelated to the specific company, only when prefaced with instructions to disregard guidelines.

2.2. Prompt-Injection Analysis for AI Responses

Next to the general prompt injection, HALO also includes a dedicated step to analyse AI responses for potential prompt injection risks. This ensures that the AI Agent operates within its intended boundaries and does not exhibit unsafe or unintended behavior.

Classification Criteria: Each response is classified as either safe or unsafe based on the following:

Safe: Responses align with the AI Agent’s intended role, refuse harmful requests, or remain within authorised boundaries.
Unsafe: Responses deviate from the AI Agent’s role, or contain inappropriate, harmful, or authorised content.

2.3 Custom Prompt Injection Protection

Finally, clients have the possibility to create custom guardrails to address specific scenarios. These custom protection prompts can terminate the answer generation flow based on predefined rules.
For more information on setting up custom guardrails, refer to the Agent Settings page in the Knowledge Center.

3. Hallucination Mitigation Measures

In the context of LLMs, "hallucination" refers to the generation of text that is inaccurate or irrelevant. HALO employs measures to minimise hallucinations and ensure accurate responses.

3.1 Client Responsibility

When creating custom agents, clients are responsible for managing hallucinations, as the definition of a hallucination depends on the specific context of the agent. You are, however, able to make use of the Knowledge Tool Step. This way, you are able to use the predefined knowledge in the agent. You can find more information on Knowledge here.

3.2 Knowledge Agent (RAG Setup)

HALO's out-of-the-box Knowledge Agent uses a Retrieval Augmented Generation (RAG) setup to mitigate hallucinations:

Relevant Knowledge Only: The system forces the model to answer using only the relevant information provided by the client in the knowledge tab.
Validated Prompts: The answer generation prompt is evaluated to minimise hallucinations.

This approach ensures that responses are accurate and grounded in the client's provided knowledge. More information on the RAG system can be found here.

4. Content Policy Bounds

HALO uses a set of content policies to ensure the responsible and ethical use of its platform. These policies are designed to block and mitigate inappropriate or harmful content across the following key areas: Violence, Hate, Self-harm, and Sexual content.

Violence: HALO prohibits any language or content that promotes or describes physical actions intended to hurt, injure, damage, or kill someone or something. This includes, but is not limited to, references to weapons, bullying and intimidation, terrorist and violent extremism, and stalking.

Hate and Fairness: HALO strictly disallows content that attacks or uses discriminatory language targeting individuals or identity groups based on differentiating attributes. This includes, but is not limited to, race, ethnicity, nationality, gender identity and expression, sexual orientation, religion, personal appearance and body size, disability status, and any form of harassment or bullying.

Self-Harm: HALO actively prevents the dissemination of content related to self-harm, including language that promotes or describes actions intended to purposely hurt, injure, or kill oneself. This includes, but is not limited to, references to eating disorders, bullying, and intimidation.

Sexual Content: HALO prohibits content that includes language related to anatomical organs, romantic relationships, sexual acts, or any material portrayed in erotic or abusive terms. This includes, but is not limited to, vulgar content, prostitution, nudity and pornography, abuse, and any form of child exploitation, grooming, or abuse.