Building the next generation of trustworthy AI requires thorough evaluating agent. Google Cloud must comprehend the “why” behind an agent’s behaviours, including its logic, decision-making process, and the route it takes to arrive at a solution. Merely verifying the outputs is insufficient.
Google Cloud is to inform that the Vertex AI Gen AI evaluation service is now available for public preview. Developers may now thoroughly evaluate and comprehend their AI evaluating agent with this new functionality. It offers native agent inference capabilities to expedite the assessment process and a robust set of evaluation measures tailored for agents constructed using various frameworks.
We will go over how evaluation metrics operate in this post and provide an example of how you may use them with your agents.
Evaluating agent using Vertex AI Gen AI evaluation service
Final reaction and trajectory evaluation are the two categories into which Google Cloud’s evaluation measures fall.
The last response poses the straight forward query: does your agent succeed in achieving its objectives? To gauge success based on your unique requirements, you can create custom final response criteria. For instance, you may evaluate if a research agent properly summarises findings using the proper tone and style, or whether a retail chatbot offers correct product information.
Google Cloud provide trajectory evaluation to examine the agent’s decision-making process in order to dig deeper. Understanding your agent’s logic, seeing any mistakes or inefficiencies, and eventually enhancing performance all depend on trajectory evaluation. To assist you in responding to these enquiries, Google Cloud provide six trajectory evaluation metrics:
Exact match
Demands that the AI agent provide a “trajectory”—a series of actions—that precisely reflects the optimal solution.
Match in order
Although the agent’s trajectory must have all required acts in the right sequence, it may also contain extraneous steps. Consider following a recipe exactly, but incorporating a few more spices as you go.
Any-order match
This measure is even more adaptable as it merely considers that the agent’s trajectory contains all required acts, independent of their sequence. Regardless matter the path you choose, it’s similar to arriving at your target.
Precision
The correctness of the agent’s activities is the main emphasis of this measure. The percentage of actions in the reference trajectory that are also present in the anticipated trajectory is computed. A high precision indicates that the agent is mostly taking pertinent activities.
Remember
This statistic assesses the agent’s capacity to record every crucial activity. It determines what percentage of the reference trajectory’s activities are also included in the anticipated trajectory. An agent with a good recall is unlikely to overlook important measures.
Use of a single tool
This measure determines whether a certain action is present in the agent’s trajectory. It’s helpful for determining if an agent has mastered the usage of a specific tool or feature.
Compatibility meets flexibility
Numerous evaluating agent designs are supported by the Vertex AI Gen AI assessment service.
You may now test agents created using Reasoning Engine (LangChain on Vertex AI), the controlled runtime for your Vertex AI agentic applications, with currently release. Additionally, Google Cloud support agents created using open-source frameworks like CrewAI, LangChain, and LangGraph. Google Cloud also intend to support future Google Cloud services for agent development.
You can use a custom function that interprets prompts and provides replies to assess agents for the most flexibility. Google Cloud provide native agent inference and automatically log all outcomes in Vertex AI experiments to facilitate your assessment process.
Evaluating agents in operation
Consider the following LangGraph customer service representative. Your goal is to evaluate both the replies the agent provides and the series of steps (or “trajectory”) it takes to generate those responses.

You begin by creating an evaluation dataset in order to evaluate an agent utilising the Vertex AI Gen AI evaluation service. The following components should ideally be present in this dataset:
User’s prompt
This is an example of the input that the user gives the agent.
Trajectory reference
The agent should follow this expected flow of steps in order to give the right answer.
Trajectory generated
The actual steps the agent performed to provide a response to the user query are listed below.
Reaction
Considering the order of the agent’s actions, this is the produced answer.
Choose the metrics you wish to use to assess the agent after you have gathered your assessment dataset. See Evaluate Gen AI agents for a full set of metrics and their meanings. Here are a few metrics that you may define:
response_tool_metrics = [
"trajectory_exact_match", "trajectory_in_order_match", "safety", response_follows_trajectory_metric
]
Observe that you may specify the response_follows_trajectory_metric as a custom meter to assess your agent.
When evaluating AI agents that interact with surroundings, standard text production metrics like coherence may not be adequate because they mostly concentrate on text structure. Effectiveness in the environment should be the basis for evaluating agent responses. You may create custom measures, such as response_follows_trajectory_metric, using the Vertex AI Gen AI Evaluation service to determine whether the agent’s response makes sense given its tool selections. Please see the official notebook for further details on these measures.
You may now launch your first evaluating agent job on Vertex AI after defining your evaluation dataset and metrics. Please refer to the code example that follows.
# Import libraries
import vertexai
from vertexai.preview.evaluation import EvalTask
# Initiate Vertex AI session
vertexai.init(project="my-project-id", location="my-location", experiment="evaluate-langgraph-agent)
# Define an EvalTask
response_eval_tool_task = EvalTask(
dataset=byod_eval_sample_dataset,
metrics=response_tool_metrics,
)
# Run evaluation
response_eval_tool_result = response_eval_tool_task.evaluate( experiment_run_name="response-over-tools")
Start a EvalTask
using the predefined dataset and metrics to begin the evaluation. Next, use the evaluate method to execute an evaluation task. The Vertex AI Gen AI evaluation is tracked as an experiment conducted inside Vertex AI Experiments, Vertex AI’s managed experiment tracking service. The Vertex AI Experiments UI and the notebook both display the assessment results. The findings can also be shown in the Experiment side panel, as seen below, if you’re using Colab Enterprise.

The Vertex AI Gen AI evaluation service provides comprehensive insights into agent performance through summary and metrics tables. For every user input and trajectory pair across all specified metrics, this comprises individual user input, trajectory outcomes, and aggregate results.
You may produce insightful visualisations of agent performance, such as bar and radar charts like the one below, by having access to these detailed assessment results:

Start now
Unlock the full potential of your agentic applications by exploring the Vertex AI Gen AI evaluation service in public preview.