{"ID":2882495,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.10993","arxiv_id":"2508.10993","title":"Match \u0026 Choose: Model Selection Framework for Fine-tuning Text-to-Image Diffusion Models","abstract":"Text-to-image (T2I) models based on diffusion and transformer architectures advance rapidly. They are often pretrained on large corpora, and openly shared on a model platform, such as HuggingFace. Users can then build up AI applications, e.g., generating media contents, by adopting pretrained T2I models and fine-tuning them on the target dataset. While public pretrained T2I models facilitate the democratization of the models, users face a new challenge: which model can be best fine-tuned based on the target data domain? Model selection is well addressed in classification tasks, but little is known in (pretrained) T2I models and their performance indication on the target domain. In this paper, we propose the first model selection framework, M\u0026C, which enables users to efficiently choose a pretrained T2I model from a model platform without exhaustively fine-tuning them all on the target dataset. The core of M\u0026C is a matching graph, which consists of: (i) nodes of available models and profiled datasets, and (ii) edges of model-data and data-data pairs capturing the fine-tuning performance and data similarity, respectively. We then build a model that, based on the inputs of model/data feature, and, critically, the graph embedding feature, extracted from the matching graph, predicts the model achieving the best quality after fine-tuning for the target domain. We evaluate M\u0026C on choosing across ten T2I models for 32 datasets against three baselines. Our results show that M\u0026C successfully predicts the best model for fine-tuning in 61.3% of the cases and a closely performing model for the rest.","short_abstract":"Text-to-image (T2I) models based on diffusion and transformer architectures advance rapidly. They are often pretrained on large corpora, and openly shared on a model platform, such as HuggingFace. Users can then build up AI applications, e.g., generating media contents, by adopting pretrained T2I models and fine-tuning...","url_abs":"https://arxiv.org/abs/2508.10993","url_pdf":"https://arxiv.org/pdf/2508.10993v1","authors":"[\"Basile Lewandowski\",\"Robert Birke\",\"Lydia Y. Chen\"]","published":"2025-08-14T18:00:50Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\",\"cs.CL\",\"cs.CV\"]","methods":"[\"Diffusion Model\",\"Transformer\"]","has_code":false}