UniPercept
Towards Unified Perceptual-Level Image Understanding across
Aesthetics, Quality, Structure and Texture
📊
UniPercept-Bench. We propose UniPercept-Bench, the first comprehensive perceptual-level MLLM benchmark spanning Image Aesthetics Assessment (IAA), Image Quality Assessment (IQA), and Image Structure & Texture Assessment (ISTA) across Visual Rating (VR) and Visual Question Answering (VQA) tasks.
🔍
Strong Baseline: UniPercept. We develop a strong baseline MLLM trained via Domain-Adaptive Pre-Training and Task-Aligned RL for perceptual-level image understanding.
🔧
Downstream Applications. Enable plug-and-play reward modeling for text-to-image generation and provide unified perceptual metrics for evaluation.
🖼️
Building Unified Profiles for Every Image. UniPercept constructs a comprehensive profile for each image, spanning the IAA, IQA, and ISTA domains and providing fine-grained, multi-dimensional outputs.
Shuo Cao* Jiayang Li* Xiaohui Li Yuandong Pu Kaiwen Zhu Yuanting Gao Siqi Luo Yi Xin Qi Qin Yu Zhou Xiangyu Chen Wenlong Zhang Bin Fu Yu Qiao Yihao Liu†
University of Science and Technology of China Shanghai AI Laboratory Peking University
* Equal contribution  † Corresponding author
Perceptual-Level Image Understanding

Perceptual-level image understanding focuses on how an image looks and feels—capturing aesthetics, quality degradations, structural regularity, and surface texture. These fine-grained perceptual cues differ fundamentally from semantic recognition, yet remain underexplored in MLLMs.
To address this, we introduce UniPercept, a unified framework that defines, evaluates, and improves perceptual-level visual understanding across the IAA, IQA, and ISTA domains.

PDF Preview
UniPercept-Bench
Interactive Statistics Viewer
Hover to view quantities. Click regions for details. Click center to return.
Benchmark Examples
Q: What visual element is most prominent due to hierarchical emphasis?
A. Floral design above the circle
B. Text below the circle
C. Cultural attire within the circle
D. Historical architecture background
IAA
Composition & Design
Hierarchical Emphasis
Q: What is your assessment of the Emotion & Viewer Response quality in this picture?
A. Low
B. Medium
C. High
IAA
Emotion & Viewer Response
Level Prediction
Q: Why do the line dynamics enhance the monkey's playful expression and pose?
A. Lines create movement and flow
B. Lines emphasize facial details
C. Lines add structural complexity
D. Lines increase color contrast
IAA
Visual Elements & Structure
Line Dynamics
Q: Why does the artist use layered brushstrokes on the peony petals?
A. Simulate natural petal surfaces
B. Emphasize the golden background
C. Obscure imperfections
D. To reduce the visual prominence
IAA
Technical Execution
Material Proficiency
Q: How does the lighting affect texture visibility in the foreground of the image?
A. Enhances stone texture clarity
B. Causes noticeable blurring
C. Creates strong shadow contrasts
D. Reduces texture detail visibility
IQA
Distortion Location
Location Description
Q: Which distortion type is evident in the image, affecting color realism?
A. Gaussian YCbCr noise
B. JPEG compression artifacts
C. Saturate strengthen YCrCb distortion
IQA
Distortion Types Presen
Which
Q: Overall, how would you rate the severity of distortions in this image?
A. None (no visible distortion)
B. Slight (barely noticeable but present)
C. Obvious (clearly visible and significantly impacts perception)
IQA
Distortion Severity
Severity Level
Q: What specific distortion is most noticeable on the lantern’s surface?
A. Overexposure causing loss of detail
B. Blurring obscuring texture details
C. High contrast creating harsh edges
IQA
Distortion Location
Location Description
Q: What is the primary 2D contour shape visible in the honeycomb structure?
A. Square
B. Hexagon
C. Circle
D. Pentagon
ISTA
Geometric Composition
2D Contour
Q: What term best describes the surface texture of the sandy scene?
A. Grooved
B. Pitted
C. Crystalline
D. Braided
ISTA
Physical Structure
Base Morphology
Q: Which stylistic classification best describes the overall visual theme of the limousine interior?
A. Futuristic Minimalism
B. Modern Luxury
C. Art Deco
D. Cyberpunk
ISTA
Semantic Perception
Stylistic Classification
Q: Which component exhibits the highest glossiness in surface properties?
A. Salmon
B. Rice
C. Stuffed Pepper
D. Spinach Leaves
ISTA
Material Representation
Surface Behavior
Q: Rate the aesthetics score of this image as a score out of 100.
Aesthetics score : 85.
IAA
Aesthetics Score
Q: Please provide a quantitative aesthetic assessment for this image on a scale from 0 to 100.
Aesthetics score : 60.
IAA
Aesthetics Score
Q: Evaluate the aesthetics of this image with a score out of 100.
Aesthetics score : 52.
IAA
Aesthetics Score
Q: Assign an aesthetics score out of 100 to this image.
Aesthetics score : 15.
IAA
Aesthetics Score
Q: Provide an overall quality assessment score for this image (0-100).
Quality score : 80.
IQA
Quality Score
Q: Assign an overall quality assessment score to this image (0-100).
Quality score : 71.
IQA
Quality Score
Q: Give an overall quality assessment score for this image on a scale of 0-100.
Quality score : 59.
IQA
Quality Score
Q: Rate this image with an overall quality assessment score from 0 to 100.
Quality score : 36.
IQA
Quality Score
Q: Rate the overall structure & texture richness of this image on a scale of 0 to 100.
Structure & texture richness score : 80.
ISTA
Structure & Texture Richness Score
Q: Assign an overall structure & texture richness score to this image (0-100).
Structure & texture richness score : 68.
ISTA
Structure & Texture Richness Score
Q: Provide an overall structure & texture richness score for this image on a scale of 0-100.
Structure & texture richness score : 37.
ISTA
Structure & Texture Richness Score
Q: Give an overall structure & texture richness score for this image from 0 to 100.
Structure & texture richness score : 11.
ISTA
Structure & Texture Richness Score
Leaderboard
Choose Perceptual Domain
Aesthetics (IAA)
Quality (IQA)
Structure & Texture (ISTA)
Choose Task
Visual Rating (VR)
Visual Question Answering (VQA)
The dark blue and medium blue values represent the best and second-best performance, respectively.
Models ArtiMuse-10K AVA TAD66K FLICKR-AES Avg
🥇 UniPercept (Ours) 0.746/0.738 0.589/0.577 0.336/0.346 0.688/0.681 0.590/0.586
🥈 GPT-4o 0.333/0.276 0.509/0.485 0.278/0.282 0.605/0.597 0.431/0.410
🥉 GLM-4.5-V-106B 0.346/0.249 0.464/0.420 0.289/0.278 0.651/0.597 0.438/0.386
ArtiMuse 0.614/0.627 0.397/0.385 0.230/0.232 0.349/0.334 0.398/0.395
QwenVL-2.5-72B 0.233/0.197 0.408/0.387 0.232/0.235 0.626/0.589 0.375/0.352
LLaVA-OneVision-1.5-8B 0.274/0.212 0.381/0.378 0.213/0.224 0.586/0.541 0.364/0.339
Q-Insight* 0.228/0.175 0.405/0.376 0.212/0.217 0.617/0.537 0.366/0.326
InternVL3.5-38B 0.219/0.175 0.359/0.357 0.201/0.208 0.559/0.529 0.334/0.317
InternVL3-8B 0.245/0.211 0.372/0.344 0.205/0.191 0.547/0.476 0.342/0.306
QwenVL-2.5-7B 0.223/0.143 0.359/0.324 0.208/0.195 0.588/0.520 0.345/0.296
Q-Align* 0.551/0.573 0.398/0.386 0.194/0.197 0.137/0.123 0.320/0.320
InternVL3-78B 0.223/0.206 0.385/0.344 0.221/0.220 0.518/0.433 0.337/0.301
Llama-4-Scout 0.204/0.147 0.345/0.329 0.236/0.210 0.548/0.506 0.333/0.298
QwenVL-3-32B 0.227/0.130 0.353/0.198 0.200/0.095 0.572/0.413 0.338/0.209
InternVL3.5-8B 0.135/0.104 0.308/0.295 0.180/0.182 0.519/0.448 0.286/0.257
QwenVL-3-8B 0.156/0.094 0.280/0.170 0.191/0.121 0.507/0.388 0.283/0.193
Gemini-2.5-pro 0.187/0.035 0.248/0.100 0.143/0.037 0.357/0.206 0.234/0.095
Claude-Sonnet-4.5-Think 0.066/0.103 0.018/0.019 0.026/0.039 -/- 0.037/0.054
Claude-Sonnet-4.5 0.041/0.027 0.003/0.013 0.040/0.047 0.037/0.049 0.030/0.034
Models KonIQ-10K SPAQ KADID PIPAL Overall
🥇 UniPercept (Ours) 0.940/0.949 0.904/0.895 0.872/0.870 0.581/0.594 0.824/0.827
🥈 Q-Insight 0.933/0.916 0.907/0.905 0.742/0.736 0.486/0.474 0.767/0.758
🥉 DeQA 0.953/0.941 0.895/0.896 0.694/0.687 0.472/0.478 0.753/0.750
Q-Align* 0.941/0.940 0.886/0.887 0.674/0.684 0.403/0.419 0.726/0.733
GPT-4o 0.695/0.744 0.874/0.881 0.677/0.646 0.325/0.349 0.643/0.655
QwenVL-3-32B 0.796/0.838 0.690/0.657 0.673/0.682 0.414/0.402 0.643/0.644
Q-Insight* 0.733/0.750 0.800/0.938 0.580/0.548 0.369/0.368 0.621/0.651
QwenVL-3-8B 0.761/0.822 0.612/0.604 0.723/0.696 0.434/0.427 0.633/0.637
InternVL3-78B 0.635/0.676 0.849/0.852 0.579/0.553 0.415/0.457 0.619/0.634
InternVL3.5-38B 0.578/0.652 0.840/0.831 0.568/0.537 0.448/0.457 0.608/0.619
QwenVL-2.5-72B 0.762/0.820 -/- 0.606/0.570 0.381/0.407 0.583/0.599
InternVL3-8B 0.574/0.646 0.828/0.800 0.496/0.475 0.435/0.459 0.583/0.595
InternVL3.5-8B 0.663/0.660 0.783/0.777 0.541/0.478 0.351/0.386 0.585/0.575
LLaVA-OneVision-1.5-8B 0.639/0.744 -/- 0.505/0.534 0.417/0.407 0.520/0.562
QwenVL-2.5-7B 0.708/0.762 -/- 0.521/0.517 0.350/0.361 0.526/0.547
Gemini-2.5-pro 0.582/0.316 0.087/0.212 0.436/0.274 0.225/-0.019 0.333/0.196
GLM-4.5-V-106B 0.721/0.765 -/- -0.142/-0.128 0.013/0.020 0.138/0.155
Llama-4-Scout 0.503/0.653 -/- -/- -/- 0.089/0.170
Claude-Sonnet-4.5 -/- 0.036/0.085 0.223/0.273 -0.131/-0.088 0.023/0.057
Claude-Sonnet-4.5-Think -/- -/- -/- -/- -/-
Models ISTA-10K
🥇 UniPercept (Ours) 0.778/0.767
🥈 InternVL3.5-38B 0.262/0.345
🥉 QwenVL-2.5-72B 0.091/0.148
Claude-Sonnet-4.5 0.125/0.089
Q-Insight* 0.060/0.152
GLM-4.5-V-106B 0.083/0.117
QwenVL-3-32B 0.084/0.106
GPT-4o -0.003/0.116
QwenVL-3-8B 0.033/0.044
QwenVL-2.5-7B -0.046/0.076
Llama-4-Scout -0.025/0.047
LLaVA-OneVision-1.5-8B -0.094/0.027
InternVL3-8B -0.127/0.046
InternVL3.5-8B -0.096/-0.025
Gemini-2.5-pro -0.230/-0.118
Claude-Sonnet-4.5-Think -/-
InternVL3-78B -/-
ArtiMuse -/-
DeQA -/-
Q-Align* -/-
Q-Insight -/-
Models ISTA Categories QA Templates Overall
Scene. Phys. Mat. Geo. Sem. How What Which Why Yes-No
🥇 UniPercept (Ours) 89.74%85.71%82.44%93.94%78.51% 82.69%89.24%78.54%83.12%85.51% 84.23%
🥈 LLaVA-OneVision-1.5-8B 78.63%85.16%82.44%72.73%80.17% 83.33%81.40%75.30%84.42%88.41% 81.13%
🥉 InternVL3-78B 79.06%85.16%77.42%69.70%78.51% 81.41%79.65%73.68%84.42%81.16% 79.28%
Gemini-2.5-pro 76.50%82.42%77.06%66.67%77.69% 78.21%78.20%75.71%82.47%71.01% 77.73%
Claude-Sonnet-4.5 76.92%78.57%74.91%90.91%77.69% 76.92%77.03%74.49%81.82%79.71% 77.32%
GLM-4.5-V-106BA12B 81.20%79.67%74.55%72.73%75.21% 80.77%76.74%73.68%79.87%78.26% 77.22%
Claude-Sonnet-4.5-Think 77.35%78.02%73.12%87.88%75.21% 76.28%74.71%74.09%81.82%76.81% 76.08%
GPT-4o 75.64%79.12%73.48%33.33%77.27% 71.79%78.78%69.23%77.92%72.46% 74.64%
InternVL3-8B 75.64%79.12%73.48%33.33%77.27% 71.79%78.78%69.23%77.92%72.46% 74.64%
QwenVL-2.5-Instruct-7B 74.79%72.53%74.91%51.52%73.55% 73.72%77.33%66.80%74.03%73.91% 73.30%
Llama-4-Scout 73.50%75.27%71.68%72.73%67.77% 75.64%69.77%69.64%77.27%69.57% 71.86%
InternVL3.5-38B 50.00%55.49%61.29%30.30%35.95% 50.64%59.30%42.91%37.01%57.97% 50.10%
InternVL3.5-8B 54.27%50.55%58.42%39.39%36.36% 46.79%56.69%48.58%29.87%71.01% 49.79%
QwenVL-3-Instruct-8B 27.78%32.42%25.45%39.39%24.79% 14.74%23.26%28.34%25.32%81.16% 27.63%
QwenVL-3-Instruct-32B 26.50%24.73%19.00%15.15%18.60% 11.54%18.31%22.67%17.53%66.67% 21.65%
QwenVL-2.5-Instruct-72B 14.10%29.12%19.71%12.12%18.60% 20.51%12.21%14.57%31.17%46.38% 19.59%
Models IAA Categories QA Templates Overall
Comp.VisStr.Tech.Creat. Theme.Emo.Gest.CompEv. Lv.PredHowWhatWhich WhyYes-No
🥇 UniPercept (Ours) 80.00%77.54%69.70%80.56% 79.26%80.95%67.53%69.77% 63.71%92.20%81.88%75.32%86.67%84.62% 76.55%
🥈 InternVL3-78B 71.79%73.26%61.21%73.15% 74.81%74.29%53.25%37.21% 45.14%85.82%81.16%72.15%86.00%75.64% 68.28%
🥉 Gemini-2.5-pro 71.79%68.45%61.59%76.85% 67.41%63.81%61.84%37.21% 45.98%78.72%73.91%67.72%84.67%84.62% 66.44%
ArtiMuse 67.69%68.45%64.85%74.07% 71.85%64.76%61.04%32.56% 39.14%88.65%76.81%72.78%85.33%79.49% 66.31%
Claude-Sonnet-4.5 70.26%70.05%62.20%71.30% 64.44%67.62%50.00%46.51% 46.84%77.30%76.09%65.19%86.00%69.23% 65.45%
Claude-4.5-Think 71.28%69.52%61.21%68.52% 62.22%66.67%53.25%41.86% 44.57%75.89%77.54%67.09%86.00%66.67% 64.73%
GLM-4.5-V-106B 67.18%65.78%60.98%75.00% 64.44%68.57%51.32%46.51% 45.40%71.63%78.26%65.82%84.67%70.51% 64.46%
QwenVL-2.5-7B 67.18%70.74%56.36%66.67% 68.89%63.81%48.05%37.21% 38.86%76.76%75.36%67.09%87.33%71.79% 63.19%
LLaVA-OneVision-1.5-8B 67.18%68.62%61.21%62.96% 67.41%62.86%53.25%20.93% 34.86%85.21%79.71%65.82%83.33%69.23% 62.60%
InternVL3-8B 65.64%67.55%59.39%67.59% 69.63%62.86%50.65%25.58% 36.00%81.69%73.91%67.72%86.00%71.79% 62.60%
Llama-4-Scout 62.56%68.45%59.76%61.11% 57.78%70.48%48.68%32.56% 43.97%70.92%69.57%61.39%77.33%70.51% 60.91%
GPT-4o 64.62%59.57%57.58%60.19% 65.19%67.62%51.95%30.23% 38.86%78.17%72.46%62.66%72.67%70.51% 60.04%
InternVL3.5-38B 37.44%40.11%27.88%39.81% 34.81%38.10%45.45%6.98% 34.00%47.52%26.09%28.48%37.33%50.00% 35.67%
InternVL3-5-8B 32.31%29.41%30.30%26.85% 28.89%26.67%23.38%9.30% 17.14%41.13%26.81%19.62%36.00%58.97% 28.18%
QwenVL-2.5-72B 22.05%24.60%25.45%29.63% 30.37%18.10%19.48%6.98% 14.00%19.86%17.39%24.05%41.33%51.28% 23.74%
Models IQA Categories QA Templates Overall
Loc.Sev.Type. Lv.PredHowWhatWhichWhyYes-No
🥇 UniPercept (Ours) 77.43%79.60%90.98% 79.60%87.03%80.86%75.60%83.42%79.31% 81.07%
🥈 LLaVA-OneVision-1.5-Instruct-8B 76.51%59.87%77.46% 59.87%91.35%70.37%61.31%82.35%75.86% 72.15%
🥉 InternVL3-78B 75.41%51.84%81.56% 51.84%93.51%66.67%63.10%88.24%66.67% 70.31%
GPT-4o 71.74%53.18%70.49% 53.18%83.78%59.26%61.31%80.21%67.82% 66.36%
Claude-Sonnet-4.5-Think 71.19%55.52%66.80% 55.52%89.19%50.00%51.79%82.89%72.41% 65.90%
QwenVL-2.5-Instruct-7B 74.13%48.83%66.39% 48.83%88.65%60.49%53.57%78.61%77.01% 65.44%
Claude-Sonnet-4.5 71.19%51.51%66.80% 51.51%90.81%50.00%50.60%82.89%71.26% 64.80%
InternVL3-8B 71.56%52.84%59.43% 52.84%87.03%59.88%48.81%71.12%71.26% 63.69%
Llama-4-Scout 60.18%58.19%52.05% 58.19%82.16%37.04%38.69%66.31%62.07% 57.81%
GLM-4.5-V-106BA12B 70.09%35.79%54.51% 35.79%88.11%48.77%44.05%74.33%68.97% 57.17%
InternVL3.5-38B 38.90%49.83%45.08% 49.83%46.49%41.36%31.55%33.16%62.07% 43.29%
Gemini-2.5-pro 32.84%52.84%40.98% 52.84%40.54%32.72%29.17%41.18%28.74% 40.17%
InternVL3.5-8B 38.17%44.82%38.11% 44.82%35.14%41.98%30.36%36.36%56.32% 39.98%
QwenVL-3-Instruct-8B 34.68%55.18%16.39% 55.18%20.54%18.52%27.38%25.67%77.01% 36.21%
QwenVL-3-Instruct-32B 29.54%14.38%16.80% 14.38%11.89%18.52%25.60%22.46%74.71% 22.52%
QwenVL-2.5-Instruct-72B 31.01%4.68%16.39% 4.68%35.14%14.81%11.31%22.99%66.67% 20.50%
VQA – ISTA TABLE
UniPercept As Reward
Quantitative Results
The dark blue and medium blue values represent the best and second-best performance, respectively.
Models Preference Score Image Quality Image Aesthetics UniPercept Score
PickScore HPSv3 DeQA LAION-Aes ArtiMuse IAA IQA ISTA
Baseline (FLUX.1-dev) 22.46 10.71 4.32 5.77 59.02 65.18 73.59 46.64
w/ UniPercept IAA Reward 22.47 10.09 4.09 6.19 67.02 76.20 76.39 54.83
w/ UniPercept IQA Reward 22.63 11.21 4.37 6.02 63.64 72.16 76.87 52.34
w/ UniPercept ISTA Reward 22.72 11.09 4.37 6.16 63.75 72.23 76.17 59.61
w/ UniPercept All Rewards 22.67 10.93 4.33 6.19 65.52 74.24 77.04 59.08
Qualitative Results
Prompt
Baseline
(FLUX.1-dev)
w / UniPercept
IAA Reward
w / UniPercept
IQA Reward
w / UniPercept
ISTA Reward
w / UniPercept
All Rewards
A modern office space featuring a sleek desk with a computer set up, including a monitor, keyboard, and mouse. Beside the computer, there's a printer with a stack of paper next to it. An ergonomic office chair is positioned in front of the desk, ready for someone to sit down and start working.
A young child with brown hair, focused intently, sits at a wooden table scattered with colorful crayons and paper. In their small hand is a bright red pencil, with which they are diligently drawing a vibrant blue flower that's taking shape on the white sheet before them. Sunlight filters through a nearby window, casting a warm glow on the child's artwork.
A striking black bird with glossy feathers sits atop the vibrant orange petals of a Bird of Paradise flower. The unique flower is positioned in the midst of an arid desert landscape, with various cacti and sparse vegetation dotting the sandy ground. In the background, the sun casts a warm glow on the distant rolling dunes.
A vibrant yellow 2017 Porsche 911 is captured in motion, navigating a winding mountain road with its sleek body hugging the curve. The sports car's headlights are piercing through the overcast weather, illuminating the path ahead. In the background, a lush green valley stretches out beneath a sky filled with grey clouds, hinting at the vast expanse beyond the road's edge.
UniPercept As Metrics
Visualization of score distributions for selected models, benchmarks, and Unipercept-metrics on generated images. The x-axis denotes the score of the corresponding metric while the y-axis represents the density.
Model
benchmark
Unipercept-Metrics
Building Unified Profiles for Every Image
UniPercept generates a comprehensive perceptual profile for each image. Click the figure below for a detailed exploration.
IAA Score: 29 / 100
IQA Score: 35 / 100
ISTA Score: 46 / 100
Contact & Cite

For any questions or collaborations, feel free to reach out at caoshuo@pjlab.org.cn.

If you find our work helpful, please consider citing the following:

@misc{cao2025uniperceptunifiedperceptuallevelimage,
      title={UniPercept: Towards Unified Perceptual-Level Image Understanding across Aesthetics, Quality, Structure, and Texture}, 
      author={Shuo Cao and Jiayang Li and Xiaohui Li and Yuandong Pu and Kaiwen Zhu and Yuanting Gao and Siqi Luo and Yi Xin and Qi Qin and Yu Zhou and Xiangyu Chen and Wenlong Zhang and Bin Fu and Yu Qiao and Yihao Liu},
      year={2025},
      eprint={2512.21675},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.21675}, 
}

@misc{cao2025artimusefinegrainedimageaesthetics,
  title={ArtiMuse: Fine-Grained Image Aesthetics Assessment with Joint Scoring and Expert-Level Understanding},
  author={Shuo Cao and Nan Ma and Jiayang Li and Xiaohui Li and Lihao Shao and Kaiwen Zhu and Yu Zhou and Yuandong Pu and Jiarui Wu and Jiaquan Wang and Bo Qu and Wenhai Wang and Yu Qiao and Dajuin Yao and Yihao Liu},
  year={2025},
  eprint={2507.14533},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2507.14533}
}