FairJudge: An Adaptive, Debiased, and Consistent LLM-as-a-Judge

AI-generated keywords: LLM-as-a-Judge systems

AI-generated Key Points

Limitations in LLM-as-a-Judge systems:
Adaptivity constraints
Biases influenced by non-semantic cues
Inconsistencies in evaluation
FairJudge approach:
Treats judging behavior as a learnable policy
Constructs high-information-density judging dataset with aligned supervision signals
Adopts curriculum-style training paradigm to enhance rubric adherence, mitigate biases, and ensure consistency across different evaluation modes
Success of FairJudge:
Not solely attributed to model scale
Consistently improves in the 2B, 4B, and 8B settings compared to larger models like Qwen2.5-72B and DeepSeek-V3-671B
Comparative analysis:
FairJudge achieves higher scores across various benchmarks compared to existing judge-oriented models like PandaLM and JudgeLM
Features of FairJudge:
Excels in multimodal evaluations
Maintains competitive accuracy compared to strong baselines
Strikes a balance between judgment quality, behavioral consistency, and inference efficiency
Impact of FairJudge:
Improves fairness and reproducibility in machine learning model assessment practices without introducing new content generation capabilities or posing significant societal risks

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Bo Yang, Lanfei Feng, Yunkui Chen, Yu Zhang, Xiao Xu, Shijian Li

arXiv: 2602.06625v1 - DOI (cs.CL)

License: CC BY 4.0

Abstract: Existing LLM-as-a-Judge systems suffer from three fundamental limitations: limited adaptivity to task- and domain-specific evaluation criteria, systematic biases driven by non-semantic cues such as position, length, format, and model provenance, and evaluation inconsistency that leads to contradictory judgments across different evaluation modes (e.g., pointwise versus pairwise). To address these issues, we propose FairJudge, an adaptive, debiased, and consistent LLM-as-a-Judge. Unlike prior approaches that treat the judge as a static evaluator, FairJudge models judging behavior itself as a learnable and regularized policy. From a data-centric perspective, we construct a high-information-density judging dataset that explicitly injects supervision signals aligned with evaluation behavior. Building on this dataset, we adopt a curriculum-style SFT-DPO-GRPO training paradigm that progressively aligns rubric adherence, bias mitigation, and cross-mode consistency, while avoiding catastrophic forgetting. Experimental results on multiple internal and public benchmarks show that FairJudge consistently improves agreement and F1, reduces non-semantic biases, and outperforms substantially larger instruction-tuned LLMs. All resources will be publicly released after acceptance to facilitate future research.

Submitted to arXiv on 06 Feb. 2026

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2602.06625v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

, , , , In the realm of LLM-as-a-Judge systems, there exist significant limitations such as adaptivity constraints, biases influenced by non-semantic cues, and inconsistencies in evaluation. To combat these issues, a novel approach called FairJudge is proposed. Unlike traditional methods that view the judge as a static evaluator, FairJudge treats judging behavior as a learnable policy. By constructing a high-information-density judging dataset with aligned supervision signals, FairJudge adopts a curriculum-style training paradigm to enhance rubric adherence, mitigate biases, and ensure consistency across different evaluation modes. Importantly, the success of FairJudge cannot be solely attributed to model scale. While larger models like Qwen2.5-72B and DeepSeek-V3-671B show varying judging performance across datasets, FairJudge consistently improves in the 2B, 4B, and 8B settings. This suggests that modeling judging behavior explicitly is more crucial than simply increasing parameter count. Comparative analysis with existing judge-oriented models like PandaLM and JudgeLM demonstrates that FairJudge consistently achieves higher scores across various benchmarks, indicating robust generalization capabilities. Ablation studies highlight the importance of consistency-oriented rewards in learning stable judgment behavior. Furthermore, FairJudge excels in multimodal evaluations by maintaining competitive accuracy compared to strong baselines. The framework strikes a balance between judgment quality, behavioral consistency, and inference efficiency. In conclusion, FairJudge presents a unified framework for LLM-as-a-Judge systems that addresses biases and enhances cross-mode consistency through structured data construction and staged training. Experimental results showcase its reliability in automatic judgments across diverse benchmarks while maintaining strong multimodal generalization and efficient inference capabilities. The impact statement emphasizes the potential of FairJudge to improve fairness and reproducibility in machine learning model assessment practices without introducing new content generation capabilities or posing significant societal risks. Overall, this work contributes positively to promoting responsible use of machine learning systems for more reliable evaluation practices.

- Limitations in LLM-as-a-Judge systems:
- Adaptivity constraints
- Biases influenced by non-semantic cues
- Inconsistencies in evaluation
- FairJudge approach:
- Treats judging behavior as a learnable policy
- Constructs high-information-density judging dataset with aligned supervision signals
- Adopts curriculum-style training paradigm to enhance rubric adherence, mitigate biases, and ensure consistency across different evaluation modes
- Success of FairJudge:
- Not solely attributed to model scale
- Consistently improves in the 2B, 4B, and 8B settings compared to larger models like Qwen2.5-72B and DeepSeek-V3-671B
- Comparative analysis:
- FairJudge achieves higher scores across various benchmarks compared to existing judge-oriented models like PandaLM and JudgeLM
- Features of FairJudge:
- Excels in multimodal evaluations
- Maintains competitive accuracy compared to strong baselines
- Strikes a balance between judgment quality, behavioral consistency, and inference efficiency
- Impact of FairJudge:
- Improves fairness and reproducibility in machine learning model assessment practices without introducing new content generation capabilities or posing significant societal risks

Summary- Some computer systems that act as judges have limitations, such as not being able to adapt well and being influenced by biases from non-meaningful signals. - FairJudge is a new approach that treats judging behavior like something that can be learned. It creates a dataset with lots of information for judging and uses a special training method to improve fairness and consistency in evaluations. - FairJudge has been successful because it consistently gets better in different settings compared to other big models like Qwen2.5-72B and DeepSeek-V3-671B. - When compared to other judge-oriented models, FairJudge performs better on different tests. - FairJudge is good at evaluating things with multiple types of information, keeps up with strong basic standards, and balances quality judgment, consistent behavior, and efficient decision-making. Definitions- Limitations: Things that hold back or restrict what something can do. - Adaptivity: The ability to change or adjust based on the situation. - Biases: Unfair preferences or opinions that affect decisions. - Inconsistencies: Differences or variations that make things not always the same. - Supervision signals: Guidance or instructions given during training to help learn better.

Introduction The use of large language models (LLMs) has become increasingly popular in various natural language processing tasks, including text generation and classification. However, as these models are being integrated into real-world applications, concerns have been raised about their fairness and reliability. In particular, LLMs used as judges for evaluating other machine learning models may suffer from adaptivity constraints, biases influenced by non-semantic cues, and inconsistencies in evaluation. To address these issues, a team of researchers proposed a novel approach called FairJudge. This research paper provides a detailed analysis of the FairJudge framework and its effectiveness in improving fairness and consistency in LLM-as-a-Judge systems. Overview of FairJudge Traditional methods treat the judge as a static evaluator with fixed criteria for judgment. However, this approach can lead to biased evaluations due to factors such as dataset imbalance or model size differences. In contrast, FairJudge treats judging behavior as a learnable policy that can be trained to improve rubric adherence and mitigate biases. The key idea behind FairJudge is to construct a high-information-density judging dataset with aligned supervision signals. This allows the model to learn from diverse examples while receiving consistent feedback on its performance. The authors also adopt a curriculum-style training paradigm where the model is gradually exposed to more complex evaluation scenarios. Experimental Results To evaluate the effectiveness of FairJudge, the researchers conducted experiments on three different datasets: 2B (a small-scale dataset), 4B (a medium-scale dataset), and 8B (a large-scale dataset). They compared FairJudge's performance with two existing judge-oriented models – PandaLM and JudgeLM – across various benchmarks. The results showed that FairJudge consistently outperformed both baselines on all three datasets. This suggests that modeling judging behavior explicitly is more crucial than simply increasing parameter count when it comes to improving fairness and consistency in LLM-as-a-Judge systems. Furthermore, ablation studies were conducted to analyze the impact of different components of FairJudge. The results showed that consistency-oriented rewards played a crucial role in learning stable judgment behavior. FairJudge also excelled in multimodal evaluations, where it had competitive accuracy compared to strong baselines while maintaining efficient inference capabilities. This highlights the framework's ability to strike a balance between judgment quality, behavioral consistency, and inference efficiency. Impact Statement The researchers emphasize the potential impact of FairJudge on promoting responsible use of machine learning systems for more reliable evaluation practices. By addressing biases and enhancing cross-mode consistency through structured data construction and staged training, FairJudge can improve fairness and reproducibility in LLM-as-a-Judge systems without introducing new content generation capabilities or posing significant societal risks. Conclusion In conclusion, FairJudge presents a unified framework for LLM-as-a-Judge systems that addresses biases and enhances cross-mode consistency through structured data construction and staged training. Experimental results showcase its reliability in automatic judgments across diverse benchmarks while maintaining strong multimodal generalization and efficient inference capabilities. This work contributes positively to promoting responsible use of machine learning systems for more reliable evaluation practices.

Created on 26 Mar. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

69.5%

Self-Taught Evaluators

cs.CL

66.1%

A Survey on LLM-as-a-Judge

cs.CL

63.4%

LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging …

cs.CL

61.3%

Humans or LLMs as the Judge? A Study on Judgement Biases

cs.CL

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.