Q-Ponder: A Unified Training Pipeline for Reasoning-based Visual Quality Assessment

Reasoning Across Different Visual Contents

Predicted Score

89.0585

🔍 Low-level Attribute Analysis

The image's brightness and exposure appear well-managed, with a balanced distribution of light across the scene. The ladybug and flower are neither underexposed nor overexposed. The contrast is adequate, providing sufficient differentiation between the bright yellow petals and the dark green background. However, there seems to be a slight softness in the overall sharpness and detail preservation, possibly due to motion blur or camera shake, which slightly reduces the clarity of fine details like the texture of the ladybug's shell and the flower's petals.

Predicted Score

78.1170

🔍 Low-level Attribute Analysis

The image's brightness and exposure appear well-managed, with a balanced distribution of light across the scene. There is neither significant underexposure nor overexposure, allowing for clear visibility of the leopard's fur pattern and the tree bark texture. The contrast is adequate, providing sufficient differentiation between the lighter and darker areas of the image. However, some areas exhibit slight blurring, possibly indicating motion or focus issues rather than inherent noise.

Predicted Score

20.3562

🔍 Low-level Attribute Analysis

The image's brightness and exposure appear to be adequate, with no significant underexposure or overexposure. However, the extreme saturation distorts the natural color balance, making the reds overly intense and the blues unnaturally vibrant. This affects the perception of exposure since it alters the true tonal values. Contrast is exaggerated due to the high saturation, creating a stark difference between light and dark areas that may lead to a loss of detail in some regions.

Predicted Score

69.4656

🔍 Low-level Attribute Analysis

The image's brightness and exposure appear adequate, with no significant underexposure or overexposure. The horse and rider are well-lit, and details in both the subject and background are visible. However, there is a slight loss of detail in some brighter areas, possibly due to highlight clipping. Global sharpness and detail preservation are compromised, likely due to quantization artifacts, which manifest as blocky transitions and less defined edges.

Predicted Score

38.9313

🔍 Low-level Attribute Analysis

The image's brightness and exposure appear to be adequate, with no significant underexposure or overexposure. Details are visible in both the subject and the background, suggesting a balanced exposure level. However, the global sharpness and detail preservation are compromised due to pixelation, which affects the clarity of fine details. This pixelation likely stems from compression artifacts or low resolution, impacting the overall perception of sharpness.

Predicted Score

84.5715

🔍 Low-level Attribute Analysis

The image's brightness and exposure appear well-managed, with a balanced distribution of light across the scene. There is neither significant underexposure nor overexposure, ensuring that details in both the subject and the background are visible. The contrast is adequate, providing clear differentiation between light and dark areas without appearing flat or excessively harsh. Global sharpness and detail preservation are satisfactory; however, some fine details might be slightly softened.

Predicted Score

79.5832

🔍 Low-level Attribute Analysis

The image's brightness and exposure appear balanced, with a warm color palette dominating the scene. The lighting seems intentional, highlighting the tree and the train while maintaining some details in the darker areas. There is no significant underexposure or overexposure, suggesting a deliberate choice to create a mood rather than lose detail. Global sharpness and detail preservation are decent, with the tree branches and leaves having defined edges.

Predicted Score

75.0843

🔍 Low-level Attribute Analysis

The image's brightness and exposure are fairly balanced, with sufficient illumination on the road and buildings, ensuring that essential details are visible. There is no significant underexposure or overexposure, although the bright sky slightly washes out some details in the upper part of the image. Global sharpness is decent, with clear outlines of the vehicles, road markings, and buildings. However, the fisheye lens introduces a notable distortion.

Predicted Score

38.0097

🔍 Low-level Attribute Analysis

The image's brightness and exposure appear to be heavily influenced by atmospheric conditions, resulting in a hazy appearance that obscures details throughout the scene. The overall brightness seems adequate, but the haze causes a significant loss of detail in both near and far objects. Contrast is very low due to the uniformity of the foggy conditions, leading to a flat image where differentiation between light and dark areas is difficult.

Predicted Score

54.7504

🔍 Low-level Attribute Analysis

The image's brightness and exposure appear to be heavily influenced by the weather condition, resulting in a significant amount of rainfall. The overall scene is relatively dim, suggesting possible underexposure due to the heavy clouds and rain, which leads to a loss of detail in the darker areas. The contrast is low, with a lack of differentiation between light and dark areas, contributing to a flat appearance.

Abstract

Recent studies demonstrate that multimodal large language models (MLLMs) can proficiently evaluate visual quality through interpretable assessments. However, existing approaches typically treat quality scoring and reasoning descriptions as separate tasks with disjoint optimization objectives, leading to a trade-off: models adept at quality reasoning descriptions struggle with precise score regression, while score-focused models lack interpretability. This limitation hinders the full potential of MLLMs in visual quality assessment, where accuracy and interpretability should be mutually reinforcing. To address this, we propose a unified two-stage training framework comprising a cold-start stage and a reinforcement learning-based fine-tuning stage. Specifically, in the first stage, we distill high-quality data from a teacher model through expert-designed prompts, initializing reasoning capabilities via cross-entropy loss supervision. In the second stage, we introduce a novel reward with Group Relative Policy Optimization (GRPO) to jointly optimize scoring accuracy and reasoning consistency. We designate the models derived from these two stages as Q-Ponder-CI and Q-Ponder. Extensive experiments show that Q-Ponder achieves state-of-the-art (SOTA) performance on quality score regression benchmarks, delivering up to 6.5% higher SRCC on cross-domain datasets. Furthermore, Q-Ponder significantly outperforms description-based SOTA models, including its teacher model Qwen-2.5-VL-72B, particularly in description accuracy and reasonableness, demonstrating the generalization potential over diverse tasks.

Experimental Results

Performance Comparison Across Dataset Categories

Each cell shows PLCC / SRCC. Top-1 and Top-2 results are highlighted.

Methods	KonIQ	SPAQ	LiveW	KADID	CSIQ	AGIQA	AVG.
Handcrafted Methods
NIQE	0.533 / 0.530	0.679 / 0.664	0.493 / 0.449	0.468 / 0.405	0.718 / 0.628	0.560 / 0.533	0.575 / 0.535
BRISQUE	0.225 / 0.226	0.490 / 0.406	0.361 / 0.313	0.429 / 0.356	0.740 / 0.556	0.541 / 0.497	0.464 / 0.392
Deep-learning Methods
NIMA	0.896 / 0.859	0.838 / 0.856	0.814 / 0.711	0.532 / 0.535	0.695 / 0.649	0.715 / 0.654	0.748 / 0.711
HyperIQA	0.917 / 0.906	0.791 / 0.788	0.772 / 0.701	0.506 / 0.468	0.752 / 0.717	0.702 / 0.640	0.740 / 0.703
DBCNN	0.884 / 0.875	0.812 / 0.806	0.773 / 0.755	0.497 / 0.484	0.586 / 0.572	0.730 / 0.641	0.714 / 0.689
MUSIQ	0.924 / 0.929	0.868 / 0.863	0.789 / 0.830	0.575 / 0.556	0.771 / 0.710	0.722 / 0.630	0.775 / 0.753
CLIP-IQA+	0.909 / 0.895	0.866 / 0.854	0.832 / 0.805	0.653 / 0.642	0.772 / 0.719	0.736 / 0.685	0.795 / 0.767
ManIQA	0.849 / 0.834	0.768 / 0.758	0.849 / 0.832	0.499 / 0.465	0.623 / 0.627	0.723 / 0.636	0.719 / 0.692
MLLM-based Methods
C2Score	0.923 / 0.910	0.867 / 0.860	0.786 / 0.772	0.500 / 0.453	0.735 / 0.705	0.777 / 0.671	0.765 / 0.729
Q-Align	0.941 / 0.940	0.886 / 0.887	0.853 / 0.860	0.674 / 0.684	0.671 / 0.737	0.772 / 0.735	0.799 / 0.807
DeQA	0.953 / 0.941	0.895 / 0.896	0.892 / 0.879	0.694 / 0.687	0.787 / 0.744	0.809 / 0.729	0.838 / 0.813
Q-Insight	0.918 / 0.895	0.903 / 0.899	0.870 / 0.839	0.702 / 0.702	0.685 / 0.640	0.816 / 0.766	0.816 / 0.790
Q-Ponder	0.937 / 0.926	0.904 / 0.906	0.882 / 0.848	0.693 / 0.701	0.832 / 0.792	0.821 / 0.755	0.845 / 0.821

Reasoning Description Quality Comparison

Quantitative comparison across six datasets. Values represent three key metrics:

Completeness

Accuracy

Reasonableness

Methods	CIDIQ	CSIQ	KADID	LIVE	TID2008	TID2013	AVG.
Co-Instruct	2.026 1.288 2.192	2.446 1.772 2.594	2.198 1.530 2.266	2.479 1.864 2.628	2.280 1.648 2.346	2.240 1.562 2.364	2.096
Q-Instruct	2.164 1.510 2.478	2.424 2.110 2.904	2.018 1.502 2.428	2.481 2.158 2.999	2.153 1.722 2.630	2.084 1.594 2.520	2.216
DepictQA	2.394 1.798 2.914	2.510 2.342 3.046	2.156 1.590 2.588	2.668 2.256 3.240	2.614 2.008 3.304	2.458 1.876 3.078	2.454
Q-Ponder	4.100 3.140 4.392	4.440 4.160 4.744	4.175 3.402 4.283	4.457 3.794 4.630	4.328 3.931 4.617	4.311 3.697 4.561	4.182

Q-Ponder: A Unified Training Pipeline for Reasoning-based Visual Quality Assessment

Reasoning Across Different Visual Contents

🔍 Low-level Attribute Analysis

🎨 High-level Attribute Analysis

📊 Overall Assessment

🔍 Low-level Attribute Analysis

🎨 High-level Attribute Analysis

📊 Overall Assessment

🔍 Low-level Attribute Analysis

🎨 High-level Attribute Analysis

📊 Overall Assessment

🔍 Low-level Attribute Analysis

🎨 High-level Attribute Analysis

📊 Overall Assessment

🔍 Low-level Attribute Analysis

🎨 High-level Attribute Analysis

📊 Overall Assessment

🔍 Low-level Attribute Analysis

🎨 High-level Attribute Analysis

📊 Overall Assessment

🔍 Low-level Attribute Analysis

🎨 High-level Attribute Analysis

📊 Overall Assessment

🔍 Low-level Attribute Analysis

🎨 High-level Attribute Analysis

📊 Overall Assessment

🔍 Low-level Attribute Analysis

🎨 High-level Attribute Analysis

📊 Overall Assessment

🔍 Low-level Attribute Analysis

🎨 High-level Attribute Analysis

📊 Overall Assessment

Abstract

Motivation

Training Framework

Experimental Results

Performance Comparison Across Dataset Categories

Reasoning Description Quality Comparison

BibTeX Citation