Getting Started with Multimodality | by Valentina Alto

Understanding vision capabilities of Large Multimodal Models

The recent advances in Generative AI have enabled the development of Large Multimodal Models (LMMs) that can process and generate different types of data, such as text, images, audio, and video.

LMMs share with “standard” Large Language Models (LLMs) the capability of generalization and adaptation typical of Large Foundation Models. However, LMMs are capable of processing data that goes beyond text, including images, audio, and video.

One of the most prominent examples of large multimodal models is GPT4V(ision), the latest iteration of the Generative Pre-trained Transformer (GPT) family. GPT-4 can perform various tasks that require both natural language understanding and computer vision, such as image captioning, visual question answering, text-to-image synthesis, and image-to-text translation.

The GPT4V (along with its newer version, the GPT-4-turbo vision), has proved extraordinary capabilities, including: