FairJudge: An Adaptive, Debiased, and Consistent LLM-as-a-Judge

AI-generated keywords: LLM-as-a-Judge systems

AI-generated Key Points

  • Limitations in LLM-as-a-Judge systems:
  • Adaptivity constraints
  • Biases influenced by non-semantic cues
  • Inconsistencies in evaluation
  • FairJudge approach:
  • Treats judging behavior as a learnable policy
  • Constructs high-information-density judging dataset with aligned supervision signals
  • Adopts curriculum-style training paradigm to enhance rubric adherence, mitigate biases, and ensure consistency across different evaluation modes
  • Success of FairJudge:
  • Not solely attributed to model scale
  • Consistently improves in the 2B, 4B, and 8B settings compared to larger models like Qwen2.5-72B and DeepSeek-V3-671B
  • Comparative analysis:
  • FairJudge achieves higher scores across various benchmarks compared to existing judge-oriented models like PandaLM and JudgeLM
  • Features of FairJudge:
  • Excels in multimodal evaluations
  • Maintains competitive accuracy compared to strong baselines
  • Strikes a balance between judgment quality, behavioral consistency, and inference efficiency
  • Impact of FairJudge:
  • Improves fairness and reproducibility in machine learning model assessment practices without introducing new content generation capabilities or posing significant societal risks
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Bo Yang, Lanfei Feng, Yunkui Chen, Yu Zhang, Xiao Xu, Shijian Li

License: CC BY 4.0

Abstract: Existing LLM-as-a-Judge systems suffer from three fundamental limitations: limited adaptivity to task- and domain-specific evaluation criteria, systematic biases driven by non-semantic cues such as position, length, format, and model provenance, and evaluation inconsistency that leads to contradictory judgments across different evaluation modes (e.g., pointwise versus pairwise). To address these issues, we propose FairJudge, an adaptive, debiased, and consistent LLM-as-a-Judge. Unlike prior approaches that treat the judge as a static evaluator, FairJudge models judging behavior itself as a learnable and regularized policy. From a data-centric perspective, we construct a high-information-density judging dataset that explicitly injects supervision signals aligned with evaluation behavior. Building on this dataset, we adopt a curriculum-style SFT-DPO-GRPO training paradigm that progressively aligns rubric adherence, bias mitigation, and cross-mode consistency, while avoiding catastrophic forgetting. Experimental results on multiple internal and public benchmarks show that FairJudge consistently improves agreement and F1, reduces non-semantic biases, and outperforms substantially larger instruction-tuned LLMs. All resources will be publicly released after acceptance to facilitate future research.

Submitted to arXiv on 06 Feb. 2026

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2602.06625v1

, , , , In the realm of LLM-as-a-Judge systems, there exist significant limitations such as adaptivity constraints, biases influenced by non-semantic cues, and inconsistencies in evaluation. To combat these issues, a novel approach called FairJudge is proposed. Unlike traditional methods that view the judge as a static evaluator, FairJudge treats judging behavior as a learnable policy. By constructing a high-information-density judging dataset with aligned supervision signals, FairJudge adopts a curriculum-style training paradigm to enhance rubric adherence, mitigate biases, and ensure consistency across different evaluation modes. Importantly, the success of FairJudge cannot be solely attributed to model scale. While larger models like Qwen2.5-72B and DeepSeek-V3-671B show varying judging performance across datasets, FairJudge consistently improves in the 2B, 4B, and 8B settings. This suggests that modeling judging behavior explicitly is more crucial than simply increasing parameter count. Comparative analysis with existing judge-oriented models like PandaLM and JudgeLM demonstrates that FairJudge consistently achieves higher scores across various benchmarks, indicating robust generalization capabilities. Ablation studies highlight the importance of consistency-oriented rewards in learning stable judgment behavior. Furthermore, FairJudge excels in multimodal evaluations by maintaining competitive accuracy compared to strong baselines. The framework strikes a balance between judgment quality, behavioral consistency, and inference efficiency. In conclusion, FairJudge presents a unified framework for LLM-as-a-Judge systems that addresses biases and enhances cross-mode consistency through structured data construction and staged training. Experimental results showcase its reliability in automatic judgments across diverse benchmarks while maintaining strong multimodal generalization and efficient inference capabilities. The impact statement emphasizes the potential of FairJudge to improve fairness and reproducibility in machine learning model assessment practices without introducing new content generation capabilities or posing significant societal risks. Overall, this work contributes positively to promoting responsible use of machine learning systems for more reliable evaluation practices.
Created on 26 Mar. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.