Guard Model

Guard model is a lightweight probabilistic model (often a smaller LLM or classifier) that evaluates inputs and outputs for adversarial intent. It performs pre-inference checks on sanitized inputs (e.g., computing embedding similarity to known adversarial examples using cosine distance, with a threshold of 0.8) and post-inference checks on generated outputs (e.g., scanning for secret patterns or anomalous behaviors). The angel model uses techniques like few-shot prompting or fine-tuned classification to assign a risk score, e.g., risk= σ(w· e) , where e is the embedding vector and σis the sigmoid function. If the risk exceeds a threshold (e.g., 0.5), the query is rejected or the output is redacted.

Last updated