The rapid advancement of visual generation models has outpaced traditional evaluation approaches, necessitating the adoption of Vision-Language Models as surrogate judges. In this work, we systematically investigate the reliability of the prevailing absolute pointwise scoring standard, across a wide spectrum of visual generation tasks.
Our analysis reveals that this paradigm is limited due to stochastic inconsistency and poor alignment with human perception. To resolve these limitations, we introduce GenArena, a unified evaluation framework that leverages a pairwise comparison paradigm to ensure stable and human-aligned evaluation.
Crucially, our experiments uncover a transformative finding that simply adopting this pairwise protocol enables off-the-shelf open-source models to outperform top-tier proprietary models. Notably, our method boosts evaluation accuracy by over 20% and achieves a Spearman correlation of 0.86 with the authoritative LMArena leaderboard, drastically surpassing the 0.36 correlation of pointwise methods.
Based on GenArena, we benchmark state-of-the-art visual generation models across diverse tasks, providing the community with a rigorous and automated evaluation standard for visual generation.
Elo-based rankings of state-of-the-art models across Basic, Reasoning, and Multi-Reference editing tasks.
Snapshot from January 2026. For the latest rankings, visit the live leaderboard.
| Models | Basic | Reasoning | MultiRef | |||
|---|---|---|---|---|---|---|
| Elo | Rank | Elo | Rank | Elo | Rank | |
| GPT Image 1.5 [High] | 1162 | 🥇 | 1204 | 🥇 | 1259 | 🥇 |
| Qwen-Image-Edit-2511 | 1065 | 🥈 | 1005 | #4 | 793 | #7 |
| Nano Banana | 1056 | 🥉 | 1130 | 🥈 | 1048 | 🥉 |
| Flux.2 [klein] 9B | 1046 | #4 | 962 | #6 | 1018 | #4 |
| LongCat-Image-Edit | 1037 | #5 | 944 | #8 | -- | -- |
| Qwen-Image-Edit-2509 | 1020 | #6 | 962 | #7 | 705 | #10 |
| GPT Image 1 [High] | 1004 | #7 | 1095 | 🥉 | 1066 | 🥈 |
| Flux.2 [dev] | 997 | #8 | 968 | #5 | 948 | #6 |
| Flux.2 [klein] 4B | 987 | #9 | 928 | #9 | 967 | #5 |
| Qwen-Image-Edit | 979 | #10 | 920 | #10 | -- | -- |
| Flux.1 Kontext [dev] | 860 | #11 | 849 | #12 | 713 | #9 |
| Bagel | 773 | #12 | 823 | #13 | 649 | #11 |
| Step1X-Edit | 739 | #13 | 744 | #14 | -- | -- |
| DreamOmni2 | 718 | #14 | 858 | #11 | 777 | #8 |
Rankings established via pairwise battles judged by Qwen3-VL-32B Instruct FP8 and aggregated into Elo scores.
GenArena orchestrates a scalable tournament among generative models through three stages.
Curate diverse instructions and sample outputs from candidate models as tournament competitors.
VLM judges evaluate model pairs with bi-directional consistency check and forced-choice mechanism.
Transform pairwise outcomes into continuous leaderboard using Bradley-Terry statistical model.
Click to expand detailed experimental results
@misc{li2026genarenaachievehumanalignedevaluation,
title={GenArena: How Can We Achieve Human-Aligned Evaluation for Visual Generation Tasks?},
author={Ruihang Li and Leigang Qu and Jingxu Zhang and Dongnan Gui and Mengde Xu and Xiaosong Zhang and Han Hu and Wenjie Wang and Jiaqi Wang},
year={2026},
eprint={2602.06013},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.06013},
}