DreamLLM: Synergistic Multimodal Comprehension and Creation

ICLR 2024 Spotlight

Paper Project Code

Abstract

This paper presents DreamLLM, a learning framework that first achieves versatile Multimodal Large Language Models (LLMs) empowered with frequently overlooked synergy between multimodal comprehension and creation. DreamLLM operates on two fundamental principles. The first focuses on the generative modeling of both language and image posteriors by direct sampling in the raw multimodal space. This approach circumvents the limitations and information loss inherent to external feature extractors like CLIP, and a more thorough multimodal understanding is obtained. Second, DreamLLM fosters the generation of raw, interleaved documents, modeling both text and image contents, along with unstructured layouts. This allows DreamLLM to learn all conditional, marginal, and joint multimodal distributions effectively. As a result, DreamLLM is the first MLLM capable of generating free-form interleaved content. Comprehensive experiments highlight DreamLLM's superior performance as a zero-shot multimodal generalist, reaping from the enhanced learning synergy.

Given any interleaved documents, DreamLLM is trained by generative modeling of everything including texts and images. The images are generated by SD image decoder conditioned on dream queries gathered MLLMs' sematics. DreamLLM also learns where to generate images by prediction special dream tokens.

Multimodal dialogue examples


Can you give a short introduction to this movie?
The Godfather, also known as Il Padrino, is a classic American crime drama film directed by Francis Ford Coppola. It is the first installment in The Godfather trilogy, which also includes The Godfather Part II (1974) and The Godfather Part III (1990). The movie is widely regarded as a masterpiece and a milestone in the mafia genre, exploring the story of the Corleone family, a fictional Italian-American crime family in New York. It follows the transformation of Michael Corleone, the youngest son of the family's aging don, Vito Corleone, as he rises to power and eventually takes on the mantle of his father. The film features a strong cast, which includes Marlon Brando as Vito Corleone, Al Pacino as Michael Corleone, James Caan as Sonny Corleone, Robert Duvall as Tom Hagen, and Diane Keaton as Kay Corleone. The Godfather is known for its iconic scenes, memorable lines, and its impact on the Hollywood film industry.
Good. Show me an illustration of a panda wearing a suit and drinking the martini like the Godfather.
An illustration of a panda wearing a suit and drinking a martini like the Godfather.



Text-to-image generation





Oil-on-canvas painting of a blue night sky with roiling energy. A fuzzy and bright yellow crescent moon shining at the top. Below the exploding yellow stars and radiating swirls of blue, a distant village sits quietly on the right. Connecting earth and sky is a flame-like cypress tree with curling and swaying branches on the left. A church spire rises as a beacon over rolling blue hills.


Generate images from text yourself!






How is image generated through dream queries?

DreamLLM
A cat and a whisky.
DreamLLM
A polar bear in the forest.

Citation

@inproceedings{dong2024dreamllm,
  author = {Dong, Runpei and Han, Chunrui and Peng, Yuang and Qi, Zekun and Ge, Zheng and Yang, Jinrong and Zhao, Liang and Sun, Jianjian and Zhou, Hongyu and Wei, Haoran and Kong, Xiangwen and Zhang, Xiangyu and Ma, Kaisheng and Yi, Li},
  title = {Dream{LLM}: Synergistic Multimodal Comprehension and Creation},
  booktitle = {The Twelfth International Conference on Learning Representations},
  url = {https://openreview.net/forum?id=y01KGvd9Bw},
  year = {2024},
}