ArtiMuse: Fine-Grained Image Aesthetics Assessment with Joint Scoring and Expert-Level Understanding

书生·妙析多模态美学理解大模型

(† corresponding authors)
1 University of Science and Technology of China   2 Shanghai AI Laboratory   3 China Academy of Art  
4 Peking University   5 Shanghai Jiao Tong University   6 Sun Yat-sen University  
7 The Chinese University of Hong Kong   8 Hong Kong Polytechnic University

📰 News & Updates

  • 🚀

    July 28, 2025

    ArtiMuse was officially released at WAIC 2025, in the forum "Evolving with AI: The Iteration and Resilience of Artistic Creativity".

  • 🚀

    July 24, 2025

    The Online Demo is now open for public access!

  • 🚀

    July 21, 2025

    The Paper and Project Page are now live!

Online Demo

Online Demo QR Code

Abstract

The rapid advancement of educational applications, artistic creation, and AI-generated content (AIGC) technologies has substantially increased practical requirements for comprehensive Image Aesthetics Assessment (IAA), particularly demanding methods capable of delivering both quantitative scoring and professional understanding. Multimodal Large Language Model (MLLM)-based IAA methods demonstrate stronger perceptual and generalization capabilities compared to traditional approaches, yet they suffer from modality bias (score-only or text-only) and lack fine-grained attribute decomposition, thereby failing to support further aesthetic assessment. In this paper, we present: (1) ArtiMuse, an innovative MLLM-based IAA model with Joint Scoring and Expert-Level Understanding capabilities; (2) ArtiMuse-10K, the first expert-curated image aesthetic dataset comprising 10,000 images spanning 5 main categories and 15 subcategories, each annotated by professional experts with 8-dimensional attributes analysis and a holistic score. Both the model and dataset will be made public to advance the field.

Overview of ArtiMuse

Teaser.

Overview of ArtiMuse. ArtiMuse provides granular, expert-level textual understanding results for images across eight fine-grained aesthetic attributes. Additionally, it achieves precise image aesthetics scoring, significantly outperforming state-of-the-art models across multiple widely-used benchmarks.

Comparison with Existing Models

Aesthetic Analysis

Aesthetics Scoring

Model AVA PARA TAD66K FLICKR-AES ArtiMuse-10K
SRCC PLCC SRCC PLCC SRCC PLCC SRCC PLCC SRCC PLCC
Traditional Models
MUSIQ 0.225 0.258 0.490 0.600 0.099 0.149 0.150 0.216 -0.060 -0.074
TANet 0.758 0.765 0.513 0.531
VILA 0.776 0.775 0.651 0.658 0.418 0.444 0.616 0.645 0.273 0.268
AesMamba 0.774 0.769 0.936 0.902 0.511 0.483
MLLMs for General-Purpose Applications
mPLUG-Owl2 0.206 0.211 0.376 0.372 0.089 0.106 0.382 0.359 0.159 0.145
ShareGPT-4V 0.213 0.199 0.509 0.417 0.097 0.091 0.335 0.289 0.076 0.057
Qwen-2.5-VL-7B 0.391 0.371 0.721 0.743 0.240 0.242 0.621 0.578 0.256 0.179
Qwen-2.5-VL-72B-instruct 0.408 0.387 0.727 0.763 0.232 0.235 0.626 0.589 0.233 0.197
InternVL3-8B 0.364 0.332 0.667 0.693 0.203 0.191 0.553 0.459 0.187 0.157
InternVL3-78B 0.385 0.344 0.666 0.694 0.221 0.220 0.518 0.433 0.223 0.206
GPT-4o 0.509 0.485 0.697 0.744 0.278 0.282 0.605 0.597 0.333 0.276
Gemini-2.0-flash 0.474 0.457 0.703 0.704 0.319 0.323 0.658 0.651 0.286 0.265
MLLMs for Image Aesthetics Assessment
Q-Instruct 0.318 0.338 0.569 0.724 0.122 0.159 0.259 0.299 -0.045 -0.056
PEAS 0.748 0.748 0.686 0.700 0.415 0.444 0.577 0.613 0.306 0.293
Q-Align 0.822 0.817 0.913 0.888 0.501 0.531 0.798 0.818 0.551 0.573
UNIAA-LLaVA 0.713 0.704 0.864 0.895 0.411 0.425 0.724 0.751
Next Token Is Enough 0.828 0.825 0.413 0.444
ArtiMuse 0.827 0.826 0.936 0.958 0.510 0.543 0.814 0.837 0.614 0.627

In comparison with existing models, ArtiMuse outperforms them by simultaneously achieving both accurate aesthetic analysis and precise aesthetics scoring in multi-dimensional assessments.

Results on Real‑world Images

Aesthetics Score
87 / 100

ArtiMuse-10K Dataset

We construct ArtiMuse-10K, a high-quality dataset comprising 10,000 carefully curated images spanning 5 primary categories: Graphic Design, 3D Design, AIGC-generated images, Photography, and Painting & Calligraphy. These categories are subdivided into 15 distinct subcategories, such as Chinese Painting, Sculpture, and Daily Photography, ensuring comprehensive representation of diverse artistic expressions. Each image is annotated by professional experts on eight aesthetic attributes and an overall aesthetics score, offering superior professional rigor and annotation granularity. ArtiMuse-10K far exceeds existing IAA datasets in diversity and granularity.

Data Curation and Model Training

Pipeline of ArtiMuse. ArtiMuse encompasses a multi-stage pipeline spanning data collection & processing, annotation generation, and model training, systematically enhancing its text evaluation capabilities and score assessment proficiency across multiple dimensions.

BibTeX

If you find our work useful, please consider citing our paper: