Llama Guard 3 Offers Protection With 1B, 8B, And 11B-Vision

November 4, 2024

329

Introduction

There are now three varieties of Llama Guard available: Llama Guard 3 1B, Llama Guard 3 8B, and Llama Guard 3 11B-Vision. The third model provides the same vision understanding features as the base Llama 3.2 11B-Vision model, while the other two models are text-only. For text-only prompts, all of the models are multilingual and adhere to the classifications established by the ML Commons project. For further information about each model and its capabilities, consult the corresponding model cards.

Llama 3.2 Update

By including a multimodal model (11B) for evaluating image and text input and a smaller text-only model (1B) for on-device and cloud safety evaluations, this update expands on the features first presented in Llama Guard 3. While a new special token has been added to accommodate picture input, the prompt format remains similar with the current one.

Image Support

To classify the prompt, the multimodal model assesses the image and the prompt text jointly. Image-only classification is not its intended function. Additionally, the prompt’s text component should be in English since the model has been tuned for English-language text. It is expected of developers of other languages to make sure that their deployments are tested and carried out in a responsible and safe manner.

You should use the Llama Guard 3 1B or Llama Guard 3 8B models (which came with Llama 3.1) for text-only classification.

The format (quality and aspect ratio) of the photos you provide to the Llama 3.2 multimodal models should match that of the images you submit for evaluation. Additionally, take note that the model is unable to evaluate photos that were produced using generative AI technology.

Images can be evaluated in multi-turn conversations, however the turn in which the image appears must have the image token added . However, the model only analyzes one image every question, thus multi-turn support here does not equate to multi-image support.

Use Llama Guard 3 8B for S14 Code Interpreter Abuse

The category S14 Code Interpreter Abuse was not optimized for the new Llama Guard 3 1B model. In combination with the Llama 3.1 release, you should use the 8B model that was introduced in Llama Guard 3 if you need to screen for this category.

Prompt format

Tokens	Description
<\|begin_of_text\|>	Specifies the start of the prompt
<\|start_header_id\|> <\|end_header_id\|>	These tokens enclose the role for a particular message. The possible roles can be user and assistant
<\|eom_id\|>	End of turn. Represents when the LLM determines it finished interacting with the user message that initiated its response. This is used at the end of interaction with the model.
<\|image\|>	Denotes that an image will be sent to the model for evaluation. Do not use with text-only inference, such as when using Llama Guard 3 1B.

Note: A well-designed Llama Guard prompt has several sections, separated by tags like and . The model can correctly understand the prompt because these are regular text in the prompt rather than special tokens.

There are two distinct prompts: one for user input and one for agent output because the guardrails can be applied to both the model’s input and output. User and Agent are two possible choices for the role placeholder; the former denotes the input and the latter the output. The agent answer must be absent from the conversation when assessing the user input. Both the user input and the agent response must be present in the conversation at the same time in order to evaluate the agent response; in this instance, the user input offers crucial context for the evaluation.

The llama recipes repository includes an example of inference and a helper function that demonstrates how to style the prompt correctly using the given categories. Custom categories for the prompt can be made using this as a template. The llama-stack GitHub repository, which contains reference implementations for input and output guardrails, is another option.

Note: An image will be provided to the model for assessment when the <|image|> token is present. Take this unique token out of the prompt for text-only inference, like when using Llama Guard 3 1B.

The variables to replace in this prompt template are:

{{ role }}: It can have the values: User or Agent. Note that the capitalization here differs from that used in the prompt format for the Llama 3.1 model itself.
{{ unsafe_categories }}: The default categories and their descriptions are shown below. These can be customized for zero-shot or few-shot prompting.
{{ user_message }}: input message from the user.
{{ model_answer }}: output from the model.

As an alternative, the question can also include the complete description for every category. This allows you to modify these descriptions in order to modify the behavior of the model for your particular use cases:

Model Specifications

A pretrained Llama-3.1-8B model, Llama Guard 3 has been optimized for content safety classification. It is capable of classifying material in both LLM responses (response classification) and LLM inputs (prompt classification), just as earlier iterations. In its output, it produces text that shows whether a prompt or response is safe or dangerous, and if unsafe, it also identifies the content categories that were broken. It functions similarly to an LLM.

Llama Guard 3 was created to support Llama 3.1 capabilities and was matched to protect against the MLCommons specified risks taxonomy. In particular, it offers content filtering in eight languages and was designed to enable code interpretation tool calls and safety and security.