Towards Open-ended Visual Quality Comparison

1Nanyang Technological University 2City University of Hong Kong 3Shanghai Jiao Tong University 4Sensetime Research
*Equal Contribution.

Abstract

Comparative settings (e.g., pairwise choice, listwise ranking) have been adopted by a wide range of subjective studies for image quality assessment (IQA), as it inherently standardizes the evaluation criteria across different observers and offers more clear-cut responses. In this work, we extend the edge of emerging large multi-modality models (LMMs) to further advance visual quality comparison into open-ended settings, that 1) can respond to open-range questions on quality comparison; 2) can provide detailed reasonings beyond direct answers. To this end, we propose the Co-Instruct. To train this first-of-its-kind open-source open-ended visual quality comparer, we collect the Co-Instruct-562K dataset, from two sources: (a) LMM-merged single image quality description, (b) GPT-4V "teacher" responses on unlabeled data. Furthermore, to better evaluate this setting, we propose the MICBench, the first benchmark on multi-image comparison for LMMs. We demonstrate that Co-Instruct not only achieves 30% higher superior accuracy than state-of-the-art open-source LMMs but also outperforms GPT-4V (its teacher) on both existing related benchmarks and the proposed MICBench."

Motivation

The motivation of open-ended visual quality comparison: comparative settings can effectively avoid the ambiguity on absolute evaluations for single images, and provide more clear-cut judgements to serve as downstream guidances.

Dataset: Co-Instruct-562K

We adopt two non-perfect supervisors: (1) Merge2Compare: Originated from single image human quality descriptions of 19K images in Q-Pathway, we randomly match them into 100K groups, removing the most similar descriptions using a text embedding model. Similar to the construction of LLaVA-150K, we prompt a single-modal LLM to compare multiple human descriptions in a group, and merge them into 100K pseudo comparisons. (2) Teach2Compare: Observing that GPT-4V has especially high accuracy on pairwise settings among existing LMMs, we leverage GPT-4V responses to expand our dataset. We collect 9K unlabeled images and match them into 30K image groups (2-4 images per group) and obtain GPT-4V responses on both caption-like general comparisons and question-answer pairs for comparisons. By integrating Q-Instruct-200K (on single images), Merge2Compare, and Teach2Compare we construct the Co-Instruct-562K, the first instruction tuning dataset designed for open-ended multi-image quality comparison.

Structure

To correctly refer to each specific image during conversation, we define a specific image-text interleaved format to handle multi-image cases, as follows:

  User: The first image: <img0> The second image: <img1> ... <query>
  Assistant: <response>
Moreover, as we need to feed multiple images together during instruction tuning, adopting the most popular LLaVA structure that linearly projects visual embeddings will exceed the context window of the language models and cause errors. Henceforth, we adopt an alternative visual abstractor structure to first reduce visual token length (from 1025 to 65 tokens per image), and then concatenate them with text embeddings to pass to language decoders.

MICBench

We build the MICBench to cover the open-ended evaluation settings on groups of three or four images, as a complementary of existing evaluation settings. It contains 2,000 groups of open-range questions equipped with multiple candidates, including Sourcing Diverse Image Groups and Multi-choice Questions (MCQs).

BibTeX

@misc{wu2024openended,
      title={Towards Open-ended Visual Quality Comparison}, 
      author={Haoning Wu and Hanwei Zhu and Zicheng Zhang and Erli Zhang and Chaofeng Chen and Liang Liao and Chunyi Li and Annan Wang and Wenxiu Sun and Qiong Yan and Xiaohong Liu and Guangtao Zhai and Shiqi Wang and Weisi Lin},
      year={2024},
      eprint={2402.16641},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}