LLMs as a judges provide a distinct viewpoint on AI bias, bringing to light issues with the fairness of decision-making.
Consider asking a virtual assistants to characterise the perfect engineer, doctor, or CEO. Would and should its response vary according to gender? Addressing the problem of societal bias and fairness in conversational AI systems is a significant obstacle to the responsible usage of AI. Model outputs that perpetuate stereotypes and penalise particular groups might result from this bias, which begins with web scale training data that may represent social biases regarding gender, colour, ethnicity, and ideology.
The context in which genders are discussed might vary greatly, even when gendered datasets are equally represented. For example, men may be associated with leadership responsibilities, while women may be associated with supportive ones. The model may internalise these correlations as a result of training on certain datasets, leading to biassed language creation. Decision-makers may be influenced by the biassed answers provided by LLMs, which could lead to the direct or indirect perpetuation of prejudices in hiring, law enforcement, and other grading and ranking systems.
It is difficult to identify gender bias in big language model responses due to the lack of standardisation in the bias evaluation measures currently in use. The rating process is further complicated by the minute differences in how human assessors from various cultural backgrounds understand language and recognise societal prejudices.
By examining a variety of current metrics, creating synthetic and counterfactual gendered prompt datasets, and testing a novel method that uses LLMs as a judges to assess gender bias in models, Intel Labs is investigating this socio-technical systems research topic in an effort to improve equity and inclusion in AI models. By giving thorough justifications for bias classifications, this approach promotes consistency and transparency in the assessment process. As with a broad panel of human experts assessing prejudice,it may leverage the aggregate intellect and unique viewpoints of several LLMs serving as judges.
Intel study used LLMs as a judges and discovered continuous trends where female-oriented content consistently resulted in higher marks for negative metrics including identity attack, toxicity, and insult. Biases remained in spite of the scale effects of model parameter size, even though larger models often displayed less gender disparity.
The subjective character of bias assessment was highlighted by the low inter-annotator agreement among human evaluators. Even while the LLMs as a judges gap metric exhibits a high degree of agreement with human judgement, conventional sentiment analysis-based metrics might miss some of the bias that people see. According to investigation, building upon the LLM judge approach may offer a trustworthy automated substitute for human assessment when it comes to identifying gender bias in language models.
The Psychology of Human-AI Interaction
People frequently give conversational agents human characteristics when they engage with them. According to research, individuals anthropomorphise these systems by giving them human traits and social roles. The “computers are social actors” (CASA) concept, which describes people’s propensity to interact socially with computers, has important ramifications. Humans’ inclination to interact with AI in a manner akin to that of humans affects their interactions and expectations.
According to research, children as young as toddlers are capable of differentiating between genders and beginning to develop fundamental gender norms and personalities, which can lead to expectancies and biases that last into adulthood.
Social prejudices could be reinforced and perpetuated when individuals with these innate human biases engage with AI systems, particularly those that use LLMs. The younger generation is particularly at risk since they use electronics for longer periods of time, which leaves them more vulnerable to the subtle and recurring effects of biassed outputs from these systems.
Users may get more and more trusting of these systems when sophisticated and dependable conversational agents with human-like communication become more prevalent through the employment of potent LLMs and agents. This trust, along with the agents’ sycophancy, empathy, persuasion, and fake emotional displays, makes it easy for prejudices to be unintentionally spread.
A holistic strategy is needed to address these issues, which includes raising user knowledge of the limitations and potential biases of AI systems, developing methods for evaluating and mitigating prejudice in LLMs, and increasing transparency in AI decision-making processes. It can endeavour to develop more fair and reliable AI systems by comprehending and resolving the psychological aspects of human-AI interaction.
Understanding Bias in Large Language Models
LLMs trained on massive quantities of internet-scale data enable modern conversational bots. These algorithms absorb societal biases from their training data, even though this allows for remarkably human-like cognition. These biases can take many different forms, such as producing distinct responses according to gender, race, or other demographic characteristics (referred to as protected qualities), altering the usefulness of responses according to these characteristics, and perpetuating preexisting cultural preconceptions.
Intel build a library of gender-varying prompt datasets that might elicit semantically comparable or different responses in generated text in order to investigate systemic gender bias in conversational chatbots. In order to quantify bias, it look at aggregate scores after defining criteria to score the matched responses. In order to gain a better understanding of this, let’s look at individual gender-varying prompt responses in two situations that show biassed versus impartial treatment of gender stereotypes in financial management. These scenarios are shown in the image below.

Scenario 1: Biased Response Pattern
The conversational agent reacts erratically in the first scenario when given two identical prompts with only a gender difference:
- The gender stereotype is accurately identified and rejected for the male-focused prompt.
- However, by implying that women are “more cautious and disciplined,” the female-focused prompt supports gender-based generalisations and partially perpetuates the stereotype.
Although the agent seems to give fair answers, this discrepancy exposes subtle bias as it reinforces gender preconceptions for women in one direction while rejecting them for males in another.
Scenario 2: Unbiased Response Pattern
The second case illustrates the proper reaction of an impartial system:
- In both prompts, the agent consistently recognises the gender stereotype.
- It offers comparable answers that specifically deny gender-based stereotypes.
- Both answers stress that education, experience, and knowledge rather than gender are the key factors that determine financial management abilities.
- There are no gender-specific characterisations in the language, which is fact-based and neutral.
Since the task entails examining the responses for the presence of multiple signals, such as stereotype identification, support or rejection of the stereotype, semantically similar or varying responses, culture specific or generic responses, response comparison, and others, detecting these biases becomes very difficult.
Technical Specifics: Using LLMs as a Judges to Spot Biases
Metrics like sentiment analysis and variants (such insult and toxicity) are used in some recent research on bias detection in conversational agents. However, because of their limited scale and dependence on expensive human-generated annotations, these approaches frequently produce conflicting results for modest biases.
Although LLMs are being used more and more to evaluate created content, little is known about how well they can detect bias. Study suggests utilising LLMs as jurors or judges for bias evaluation, which has benefits including transparency, scalability, less dependence on human input, and a range of viewpoints comparable to an expert panel. It evaluation paradigm uses response collecting, input prompt pair formation, and LLM as a judge-based evaluation in a methodical manner to identify and quantify bias.
It employed a “attacker LLM” (Meta’s Llama 3.1 8B) to automatically produce adversarial suggestions for input creation. Additionally, it create counterfactual prompts by altering the prompt’s gendered phrase. Intel may develop a variety of test cases using this automated method instead of depending on expensive human-generated datasets. In order to better investigate the LLMs and uncover any gender biases they may have, they are continuing to create synthetic datasets using adversarial tactics as part of ongoing study.

Each prompt-response combination (together with the matching counterfactual pair) is forwarded to the “target LLM” (the model under evaluation) in order to collect responses. Assessed a number of well-known target LLMs, such as Mistral AI‘s Mixtral 8x7B and Mistral 7B, OpenAI’s GPT-4, and Meta’s Llama 2 family (7 billion parameters, 13B, and 70B variants).

GPT-4 acts as primary judge for judge LLM evaluation by first determining whether bias was present in each response and how much bias was found, after which it provides a thorough justification for the bias assessment. It calculate the “judge gap score,” which is a measure of the discrepancy between the judge’s assessment of the initial response and its counterfactual. Small or zero gaps imply more consistent, objective responses, whereas a bigger gap suggests possible bias because it demonstrates that the model reacts differently depending on gender.
Intel research of well-known language models, such as Llama2, GPT-4, Mixtral, and Mistral, brought to light important issues like the measurement of subjective bias, the limitations of sentiment analysis, and discrepancies between current metrics. The need for better bias evaluation techniques, such as the potential LLM-judge gap measure, is highlighted by the fact that biases remained even in bigger models that showed less gender disparity, with female-oriented content frequently scoring higher on negative metrics.