GenArena: How Can We Achieve Human-Aligned Evaluation for Visual Generation Tasks?

Ruihang Li1,2,3, Leigang Qu4, Jingxu Zhang1, Dongnan Gui1,
Mengde Xu3, Xiaosong Zhang3, Han Hu3, Wenjie Wang1†, Jiaqi Wang2†
1University of Science and Technology of China 2Shanghai Innovation Institute    3Tencent 4National University of Singapore
GenArena Framework

GenArena leverages pairwise comparison to achieve stable and human-aligned evaluation for visual generation tasks.

Abstract

The rapid advancement of visual generation models has outpaced traditional evaluation approaches, necessitating the adoption of Vision-Language Models as surrogate judges. In this work, we systematically investigate the reliability of the prevailing absolute pointwise scoring standard, across a wide spectrum of visual generation tasks.

Our analysis reveals that this paradigm is limited due to stochastic inconsistency and poor alignment with human perception. To resolve these limitations, we introduce GenArena, a unified evaluation framework that leverages a pairwise comparison paradigm to ensure stable and human-aligned evaluation.

Crucially, our experiments uncover a transformative finding that simply adopting this pairwise protocol enables off-the-shelf open-source models to outperform top-tier proprietary models. Notably, our method boosts evaluation accuracy by over 20% and achieves a Spearman correlation of 0.86 with the authoritative LMArena leaderboard, drastically surpassing the 0.36 correlation of pointwise methods.

Based on GenArena, we benchmark state-of-the-art visual generation models across diverse tasks, providing the community with a rigorous and automated evaluation standard for visual generation.

GenArena Leaderboard

Elo-based rankings of state-of-the-art models across Basic, Reasoning, and Multi-Reference editing tasks.

Snapshot from January 2026. For the latest rankings, visit the live leaderboard.

Models Basic Reasoning MultiRef
Elo Rank Elo Rank Elo Rank
GPT Image 1.5 [High] 1162 🥇 1204 🥇 1259 🥇
Qwen-Image-Edit-2511 1065 🥈 1005 #4 793 #7
Nano Banana 1056 🥉 1130 🥈 1048 🥉
Flux.2 [klein] 9B 1046 #4 962 #6 1018 #4
LongCat-Image-Edit 1037 #5 944 #8 -- --
Qwen-Image-Edit-2509 1020 #6 962 #7 705 #10
GPT Image 1 [High] 1004 #7 1095 🥉 1066 🥈
Flux.2 [dev] 997 #8 968 #5 948 #6
Flux.2 [klein] 4B 987 #9 928 #9 967 #5
Qwen-Image-Edit 979 #10 920 #10 -- --
Flux.1 Kontext [dev] 860 #11 849 #12 713 #9
Bagel 773 #12 823 #13 649 #11
Step1X-Edit 739 #13 744 #14 -- --
DreamOmni2 718 #14 858 #11 777 #8

Rankings established via pairwise battles judged by Qwen3-VL-32B Instruct FP8 and aggregated into Elo scores.

Method Overview

GenArena orchestrates a scalable tournament among generative models through three stages.

1

Competitive Sampling

Curate diverse instructions and sample outputs from candidate models as tournament competitors.

2

Pairwise Judging

VLM judges evaluate model pairs with bi-directional consistency check and forced-choice mechanism.

3

Elo Aggregation

Transform pairwise outcomes into continuous leaderboard using Bradley-Terry statistical model.

Qualitative Results

Image Editing Results

More Cases of Multi-Reference Image Generation

More Cases of Multi-Reference Image Generation

Quantitative Results

Click to expand detailed experimental results

BibTeX

@misc{li2026genarenaachievehumanalignedevaluation,
      title={GenArena: How Can We Achieve Human-Aligned Evaluation for Visual Generation Tasks?},
      author={Ruihang Li and Leigang Qu and Jingxu Zhang and Dongnan Gui and Mengde Xu and Xiaosong Zhang and Han Hu and Wenjie Wang and Jiaqi Wang},
      year={2026},
      eprint={2602.06013},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.06013},
}