Friday, March 28, 2025

Vertex AI and LLM Comparator For Gen AI Model Evaluation

Discover how Vertex AI Evaluation Service and LLM Comparator help you assess and compare generative AI models for performance, accuracy, and efficiency.

Pairwise model evaluation to assess performance 

When two models are evaluated pairwise, their relative performance on a given task is evaluated by directly comparing them to one another. Pairwise model evaluation for LLMs has three primary advantages:

Make informed decisions

Due to the expanding number and variety of LLMs, you must carefully choose the best model for your assignment. Each option’s pros and cons must be considered.

Define “better” quantitatively

Natural language texts and images produced by generative AI models are typically unstructured, long, and challenging to assess automatically without human assistance. With human examination, pairwise aids in defining “better” solutions that are nearly identical to human responses to each challenge.

Keep an eye out

LLMs should be regularly retrained and adjusted using fresh data in order to improve upon their earlier iterations and other state-of-the-art models.

The proposed evaluation process for LLMs
Image credit to Google Cloud

Vertex AI evaluation service 

Any generative model or application can be evaluated with Vertex AI’s Gen AI assessment service, which allows you to use your own evaluation criteria and compare the evaluation results to your own judgement. It aids in:

  • Choosing a model from a variety of models for particular use cases
  • Optimisation of model configuration using various model parameters
  • Quick engineering to achieve the desired responses and behaviour
  • Optimising LLMs to increase safety, equity, and correctness
  • RAG architecture optimisation
  • Migration between a model’s many versions
  • Controlling the quality of translation between several languages
  • Assessing agents
  • Analysing pictures and videos

Additionally, it enables computation-based metrics using ground-truth datasets of input and output pairs, as well as model-based metrics for both pointwise and pairwise evaluations.

How to use Vertex AI evaluation service

You can thoroughly evaluate your generative AI models with the aid of the Vertex AI evaluation service. Using pre-made templates or your own knowledge, you can create unique metrics to accurately gauge performance in relation to your predetermined objectives. The service offers computation-based metrics such as ROUGE-L for summarisation, BLEU for translation, and F1 scores for classification for common NLP tasks.

Pairwise evaluations let you measure which model performs better for direct model comparison. Judge models offer insightful justifications for their scoring choices, and metrics such as candidate_model_win_rate and baseline_model_win_rate are computed automatically. Additionally, pairwise comparisons against the ground truth data can be made utilising computation-based metrics.

In addition to pre-built measures, you can create your own by utilising mathematical formulas or prompts that help “judge models” that are in line with the context of the metrics you’ve set. Semantic similarity can also be assessed using metrics based on embedding.

Vertex AI Experiments and Metadata intelligently arranges and tracks your datasets, outcomes, and models through a smooth integration with the evaluation service. Using the Python SDK or REST API, you can quickly start evaluation jobs and export the findings to Cloud Storage for additional analysis and visualisation.

In essence, the Vertex AI evaluation service provides a comprehensive framework for

  • Model performance is measured using both proprietary and conventional measures.
  • Direct model comparison: By conducting paired analyses and evaluating model findings.
  • Adapting assessments: To your unique requirements.
  • Simplifying your process: with simple API access and integrated tracking.

Additionally, it offers templates and assistance to assist you in defining your own measurements, either from scratch or with reference to those templates, based on your experiences with generative AI and prompt engineering.

LLM Comparator: An open-source tool for human-in-the-loop LLM evaluation

PAIR (People + AI Research; PAIR) at Google created the assessment tool LLM Comparator, which is now a research project.

LLM Comparator is a great tool to supplement automatic LLM evaluation with human-in-the-loop procedures because of its very user-friendly interface for side-by-side comparisons of various model outputs. Using a variety of interesting parameters, including Model A or B’s win rates per prompt category, the tool offers helpful features to help you compare the responses from two LLMs side by side. A feature known as Custom Functions makes it easy to add user-defined metrics to the tool.

The ‘Score Distribution’ and ‘Metrics by Prompt Category’ visualisations allow you to compare the performance of Model A with Model B across a number of metrics and prompt categories. Additionally, by graphically summarising the main arguments impacting the evaluation findings, the “Rationale Summary” visualisation sheds light on why one model performs better than another.

The “Rationale Summary” panel visually explains why one model’s responses are determined to be better
Image credit to Google Cloud

LLM Comparator can be installed locally and is accessible as a Python package on PyPI. Using the supplied libraries, pairwise evaluation results from the Vertex AI Evaluation Service can also be imported into LLM Comparator.

In the last phases of LLM review, when human-in-the-loop procedures are required to guarantee overall quality, LLM Comparator can be a very useful tool thanks to features like the Rationale Cluster visualisation and Custom Functions.

Feedback from the field: How LLM Comparator adds value to Vertex AI evaluation service

LLM Comparator eliminates many of the tasks required of ML developers to create their own visualisations and quality monitoring tools by providing human assessors with readily usable, simple visualisations and automatically derived performance measures. Vertex AI evaluation service and LLM Comparator can be easily integrated without requiring a significant amount of programming work because of the JSON data format and schema of LLM Comparator.

It teams have informed us that the “Rationale Summary” visualisation is the most helpful aspect of LLM Comparator. The “Rationale Summary” is a type of explainable Artificial Intelligence (XAI) tool that is highly helpful in determining why, in the judging model’s opinion, one particular model is superior to the other. The ability to comprehend how one language model differs from another is another significant feature of the “Rationale Summary” visualisation. This can occasionally be a crucial aid in determining why a particular model is better suited for a given task.

One drawback of LLM Comparator is that it can only be used to evaluate models pairwise; it cannot be used to evaluate many models at once. Nevertheless, LLM Comparator currently includes the fundamental elements needed for comparative LLM evaluations, thus adding support for simultaneous multiple model evaluation might not present many technological challenges. Contributing to the LLM Comparator project could be a great opportunity for you.

Conclusion 

This post explained how to use Vertex AI and LLM Comparator, an open source LLM evaluation tool developed by PAIR, to structure the LLM review process. Google Cloud has introduced a semi-automated method for methodically assessing and contrasting the performance of several LLMs on Google Cloud by merging Vertex AI Evaluation Service with LLM Comparator.

Drakshi
Drakshi
Since June 2023, Drakshi has been writing articles of Artificial Intelligence for govindhtech. She was a postgraduate in business administration. She was an enthusiast of Artificial Intelligence.
RELATED ARTICLES

Recent Posts

Popular Post