Friday, February 7, 2025

Evaluating Agent In Vertex AI Gen AI Evaluation Service

Building the next generation of trustworthy AI requires thorough evaluating agent. Google Cloud must comprehend the “why” behind an agent’s behaviours, including its logic, decision-making process, and the route it takes to arrive at a solution. Merely verifying the outputs is insufficient.

Google Cloud is to inform that the Vertex AI Gen AI evaluation service is now available for public preview. Developers may now thoroughly evaluate and comprehend their AI evaluating agent with this new functionality. It offers native agent inference capabilities to expedite the assessment process and a robust set of evaluation measures tailored for agents constructed using various frameworks.

We will go over how evaluation metrics operate in this post and provide an example of how you may use them with your agents.

Evaluating agent using Vertex AI Gen AI evaluation service

Final reaction and trajectory evaluation are the two categories into which Google Cloud’s evaluation measures fall.

The last response poses the straight forward query: does your agent succeed in achieving its objectives? To gauge success based on your unique requirements, you can create custom final response criteria. For instance, you may evaluate if a research agent properly summarises findings using the proper tone and style, or whether a retail chatbot offers correct product information.

Google Cloud provide trajectory evaluation to examine the agent’s decision-making process in order to dig deeper. Understanding your agent’s logic, seeing any mistakes or inefficiencies, and eventually enhancing performance all depend on trajectory evaluation. To assist you in responding to these enquiries, Google Cloud provide six trajectory evaluation metrics:

Exact match

Demands that the AI agent provide a “trajectory”—a series of actions—that precisely reflects the optimal solution.

Match in order

Although the agent’s trajectory must have all required acts in the right sequence, it may also contain extraneous steps. Consider following a recipe exactly, but incorporating a few more spices as you go.

Any-order match

This measure is even more adaptable as it merely considers that the agent’s trajectory contains all required acts, independent of their sequence. Regardless matter the path you choose, it’s similar to arriving at your target.

Precision

The correctness of the agent’s activities is the main emphasis of this measure. The percentage of actions in the reference trajectory that are also present in the anticipated trajectory is computed. A high precision indicates that the agent is mostly taking pertinent activities.

Remember

This statistic assesses the agent’s capacity to record every crucial activity. It determines what percentage of the reference trajectory’s activities are also included in the anticipated trajectory. An agent with a good recall is unlikely to overlook important measures.

Use of a single tool

This measure determines whether a certain action is present in the agent’s trajectory. It’s helpful for determining if an agent has mastered the usage of a specific tool or feature.

Compatibility meets flexibility

Numerous evaluating agent designs are supported by the Vertex AI Gen AI assessment service.

You may now test agents created using Reasoning Engine (LangChain on Vertex AI), the controlled runtime for your Vertex AI agentic applications, with currently release. Additionally, Google Cloud support agents created using open-source frameworks like CrewAI, LangChain, and LangGraph. Google Cloud also intend to support future Google Cloud services for agent development.

You can use a custom function that interprets prompts and provides replies to assess agents for the most flexibility. Google Cloud provide native agent inference and automatically log all outcomes in Vertex AI experiments to facilitate your assessment process.

Evaluating agents in operation

Consider the following LangGraph customer service representative. Your goal is to evaluate both the replies the agent provides and the series of steps (or “trajectory”) it takes to generate those responses.

LangGraph
Image credit to Google Cloud

You begin by creating an evaluation dataset in order to evaluate an agent utilising the Vertex AI Gen AI evaluation service. The following components should ideally be present in this dataset:

User’s prompt

This is an example of the input that the user gives the agent.

Trajectory reference

The agent should follow this expected flow of steps in order to give the right answer.

Trajectory generated

The actual steps the agent performed to provide a response to the user query are listed below.

Reaction

Considering the order of the agent’s actions, this is the produced answer.

Choose the metrics you wish to use to assess the agent after you have gathered your assessment dataset. See Evaluate Gen AI agents for a full set of metrics and their meanings. Here are a few metrics that you may define:

response_tool_metrics = [
    "trajectory_exact_match", "trajectory_in_order_match", "safety", response_follows_trajectory_metric
]

Observe that you may specify the response_follows_trajectory_metric as a custom meter to assess your agent.

When evaluating AI agents that interact with surroundings, standard text production metrics like coherence may not be adequate because they mostly concentrate on text structure. Effectiveness in the environment should be the basis for evaluating agent responses. You may create custom measures, such as response_follows_trajectory_metric, using the Vertex AI Gen AI Evaluation service to determine whether the agent’s response makes sense given its tool selections. Please see the official notebook for further details on these measures.

You may now launch your first evaluating agent job on Vertex AI after defining your evaluation dataset and metrics. Please refer to the code example that follows.

# Import libraries 
import vertexai
from vertexai.preview.evaluation import EvalTask
# Initiate Vertex AI session
vertexai.init(project="my-project-id", location="my-location", experiment="evaluate-langgraph-agent)
# Define an EvalTask
response_eval_tool_task = EvalTask(
    dataset=byod_eval_sample_dataset,
    metrics=response_tool_metrics,
)
# Run evaluation
response_eval_tool_result = response_eval_tool_task.evaluate(                                                      experiment_run_name="response-over-tools")

Start a EvalTask using the predefined dataset and metrics to begin the evaluation. Next, use the evaluate method to execute an evaluation task. The Vertex AI Gen AI evaluation is tracked as an experiment conducted inside Vertex AI Experiments, Vertex AI’s managed experiment tracking service. The Vertex AI Experiments UI and the notebook both display the assessment results. The findings can also be shown in the Experiment side panel, as seen below, if you’re using Colab Enterprise.

Experiment side panel
Image credit to Google Cloud

The Vertex AI Gen AI evaluation service provides comprehensive insights into agent performance through summary and metrics tables. For every user input and trajectory pair across all specified metrics, this comprises individual user input, trajectory outcomes, and aggregate results.

You may produce insightful visualisations of agent performance, such as bar and radar charts like the one below, by having access to these detailed assessment results:

Radar charts
Image credit to Google Cloud

Start now

Unlock the full potential of your agentic applications by exploring the Vertex AI Gen AI evaluation service in public preview.

Drakshi
Drakshi
Since June 2023, Drakshi has been writing articles of Artificial Intelligence for govindhtech. She was a postgraduate in business administration. She was an enthusiast of Artificial Intelligence.
RELATED ARTICLES

Recent Posts

Popular Post

Govindhtech.com Would you like to receive notifications on latest updates? No Yes