In image generation, diffusion models have significantly advanced, leading to the widespread availability of top-tier models on open-source platforms. Despite these strides, challenges in text-to-image systems persist, particularly in managing diverse inputs and being confined to single-model outcomes. Unified efforts commonly address two distinct facets: first, the parsing of various prompts during the input stage, and second, the activation of expert models for generating output.
Recent years have seen the rise of diffusion models like DALLE-2 and Imagen, transforming image editing and stylization. However, their non-open source nature impedes widespread adoption. Stable Diffusion (SD), an open-source text-to-image model, and its latest iteration, SDXL, have gained popularity. Challenges include model limitations and prompt constraints, which are addressed through approaches like SD1.5+Lora and prompt engineering. Despite progress, achieving optimal performance still needs to be completed. Various methods, such as prompt engineering and fixed templates, partially address challenges in stable diffusion models. However, lacking a comprehensive solution prompts the question: Can a unified framework be devised to unlock prompt constraints and activate domain expert models?
ByteDance and Sun Yat-Sen University researchers have proposed DiffusionGPT, employing a Large Language Model (LLM) to create an all-encompassing generation system. Utilizing a Tree-of-Thought (ToT) structure, it integrates various generative models based on prior knowledge and human feedback. The LLM parses prompt and guides the ToT to select the most suitable model for generating the desired output. Advantage Databases enhance the ToT with valuable human feedback, aligning the model selection process with human preferences, thus providing a comprehensive and user-informed solution.
The system(DifusionGPT) follows a four-step workflow: Prompt Parse, Tree-of-thought of Models Build and Search, Model Selection with Human Feedback, and Execution of Generation. The Prompt Parse stage extracts salient information from diverse prompts, while the Tree-of-Thought of Models constructs a hierarchical model tree for efficient searching. Model Selection leverages human feedback through Advantage Databases, ensuring alignment with user preferences. The chosen generative model then undergoes the Execution of Generation, with a Prompt Extension Agent enhancing prompt quality for improved outputs.
Researchers employed ChatGPT as the LLM controller in the experimental setup, integrating it into the LangChain framework for precise guidance. DiffusionGPT showcased superior performance compared to baseline models such as SD1.5 and SD XL across various prompt types. Notably, DiffusionGPT addressed semantic limitations and enhanced image aesthetics, outperforming SD1.5 in both image-reward and aesthetic scores by 0.35% and 0.44%, respectively.
To conclude, The proposed Diffusion-GPT by the researchers from ByteDance Inc. and Sun Yat-Sen University introduces a comprehensive framework that seamlessly integrates high-quality generative models, effectively handling a variety of prompts. Utilizing LLMs and a ToT structure, Diffusion-GPT adeptly interprets input prompts and selects the most suitable model. This adaptable training-free solution showcases exceptional performance across diverse prompts and domains. It also incorporates human feedback through Advantage Databases, offering an efficient and easily integrable plug-and-play solution conducive to community development in the field.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.
If you like our work, you will love our newsletter..
Don’t Forget to join our Telegram Channel