GenArena: Human-Aligned Evaluation for Visual Generation

Abstract

The rapid advancement of visual generation models has outpaced traditional evaluation approaches, necessitating the adoption of Vision-Language Models as surrogate judges. In this work, we systematically investigate the reliability of the prevailing absolute pointwise scoring standard, across a wide spectrum of visual generation tasks.

Our analysis reveals that this paradigm is limited due to stochastic inconsistency and poor alignment with human perception. To resolve these limitations, we introduce GenArena, a unified evaluation framework that leverages a pairwise comparison paradigm to ensure stable and human-aligned evaluation.

Crucially, our experiments uncover a transformative finding that simply adopting this pairwise protocol enables off-the-shelf open-source models to outperform top-tier proprietary models. Notably, our method boosts evaluation accuracy by over 20% and achieves a Spearman correlation of 0.86 with the authoritative LMArena leaderboard, drastically surpassing the 0.36 correlation of pointwise methods.

Based on GenArena, we benchmark state-of-the-art visual generation models across diverse tasks, providing the community with a rigorous and automated evaluation standard for visual generation.

GenArena Leaderboard

Elo-based rankings of state-of-the-art models across Basic, Reasoning, and Multi-Reference editing tasks.

Snapshot from January 2026. For the latest rankings, visit the live leaderboard.

Models	Basic		Reasoning		MultiRef
Models	Elo	Rank	Elo	Rank	Elo	Rank
GPT Image 1.5 [High]	1162	🥇	1204	🥇	1259	🥇
Qwen-Image-Edit-2511	1065	🥈	1005	#4	793	#7
Nano Banana	1056	🥉	1130	🥈	1048	🥉
Flux.2 [klein] 9B	1046	#4	962	#6	1018	#4
LongCat-Image-Edit	1037	#5	944	#8	--	--
Qwen-Image-Edit-2509	1020	#6	962	#7	705	#10
GPT Image 1 [High]	1004	#7	1095	🥉	1066	🥈
Flux.2 [dev]	997	#8	968	#5	948	#6
Flux.2 [klein] 4B	987	#9	928	#9	967	#5
Qwen-Image-Edit	979	#10	920	#10	--	--
Flux.1 Kontext [dev]	860	#11	849	#12	713	#9
Bagel	773	#12	823	#13	649	#11
Step1X-Edit	739	#13	744	#14	--	--
DreamOmni2	718	#14	858	#11	777	#8

Rankings established via pairwise battles judged by Qwen3-VL-32B Instruct FP8 and aggregated into Elo scores.

Method Overview

GenArena orchestrates a scalable tournament among generative models through three stages.

1

Competitive Sampling

Curate diverse instructions and sample outputs from candidate models as tournament competitors.

2

Pairwise Judging

VLM judges evaluate model pairs with bi-directional consistency check and forced-choice mechanism.

3

Elo Aggregation

Transform pairwise outcomes into continuous leaderboard using Bradley-Terry statistical model.

Qualitative Results

More Cases of Multi-Reference Image Generation

Quantitative Results

Click to expand detailed experimental results

Pairwise scoring is more accurate than its pointwise alternatives

We quantitatively validate the superiority of the pairwise paradigm over traditional absolute pointwise scoring across a wide spectrum of visual generation tasks. Simply switching from a pointwise to a pairwise paradigm triggers an immediate and substantial accuracy boost for identical models without any parameter updates.

Model	Image Generation	Image Editing		Video Generation
Model	GenAI-Bench	GenAI-Bench	EditScore	GenAI-Bench	VideoGen-Reward
Proprietary VLMs
GPT-4.1	--	--	70.5	--	--
GPT-5	--	--	75.5	--	--
Gemini-2.5 Pro	--	--	72.2	--	--
Finetuned Models
VisionReward	66.4	--	--	73.1	68.2
Qwen2.5-VL-7B	--	--	43.2	--	--
EditScore-7B	--	--	65.9	--	--
Qwen2.5-VL-72B	--	--	62.1	--	--
EditScore-72B	--	--	70.3	--	--
Qwen3-VL 8B Instruct
EditScore-Qwen3-8B	--	--	69.0	--	--
UnifiedReward-Qwen3-VL-8B	64.2	81.5	75.0	69.1	73.6
+ Pairwise	67.0	82.5	73.3	78.6	78.8
Open Source VLMs
Qwen3-VL 8B Instruct	49.1	73.4	58.3	62.0	57.0
+ Pairwise	60.5	83.9	83.7	73.3	61.5
GLM 4.6V Flash (9B)	48.2	73.2	68.3	63.0	--
+ Pairwise	54.7	81.3	87.2	76.6	--
InternVL 3.5 8B	50.7	66.4	53.4	--	--
+ Pairwise	61.9	83.1	75.0	--	--

Pairwise scoring (rows with green background) enables off-the-shelf open-source VLMs to achieve state-of-the-art accuracy, surpassing proprietary systems like GPT-5. Bold indicates the best result, underline denotes the second best.

Pairwise scoring is more consistent than its pointwise alternatives

Beyond alignment with human preference, a robust benchmark demands internal stability. We quantify self-consistency using Krippendorff's Alpha, treating independent inference runs with identical inputs as distinct raters. Higher values indicate better consistency.

Benchmark	Pointwise	Pairwise
Reward Model Benchmarks
GenAI-Bench	0.7256	0.8628
EditScore-Bench	0.5753	0.7087
Image Editing Benchmarks
ImgEdit	0.5707	0.7040
GEdit-Bench	0.5169	0.6553

Krippendorff's Alpha values calculated over 5 independent inference runs using Qwen3-VL 8B Instruct as the judge.

Elo ranking aligns more with human preferences

We compare the ranking alignment of absolute pointwise scores and our pairwise Elo ratings against human preference represented by LMArena. Our pairwise approach achieves significantly better correlation with human judgment.

Models	GEdit-EN (Pointwise)		GEdit-EN (Elo)		LMArena
Models	Score	Rank	Score	Rank	Score
Nano Banana	7.10	#4	1062	🥇	1325
Flux.2 [dev]	7.24	#3	1053	🥈	1250
Qwen-Image-Edit	7.56	#1	992	#4	1231
Flux.1 Kontext [dev]	6.00	#7	964	#5	1166
GPT Image 1 [High]	7.53	#2	1034	🥉	1155
Bagel	6.52	#6	890	#7	1042
Step1X-Edit	6.70	#5	938	#6	1014
Correlation w/ LMArena	0.36		0.86		--

Our pairwise Elo approach achieves a Spearman correlation of 0.86 with human preference, drastically outperforming the 0.36 correlation of pointwise baselines.

BibTeX

@misc{li2026genarenaachievehumanalignedevaluation,
      title={GenArena: How Can We Achieve Human-Aligned Evaluation for Visual Generation Tasks?},
      author={Ruihang Li and Leigang Qu and Jingxu Zhang and Dongnan Gui and Mengde Xu and Xiaosong Zhang and Han Hu and Wenjie Wang and Jiaqi Wang},
      year={2026},
      eprint={2602.06013},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.06013},
}

GenArena: How Can We Achieve Human-Aligned Evaluation for Visual Generation Tasks?

GenArena leverages pairwise comparison to achieve stable and human-aligned evaluation for visual generation tasks.