The recent advances in Generative AI have enabled the development of Large Multimodal Models (LMMs) that can process and generate different types of data, such as text, images, audio, and video.
LMMs share with “standard” Large Language Models (LLMs) the capability of generalization and adaptation typical of Large Foundation Models. However, LMMs are capable of processing data that goes beyond text, including images, audio, and video.
One of the most prominent examples of large multimodal models is GPT4V(ision), the latest iteration of the Generative Pre-trained Transformer (GPT) family. GPT-4 can perform various tasks that require both natural language understanding and computer vision, such as image captioning, visual question answering, text-to-image synthesis, and image-to-text translation.
The GPT4V (along with its newer version, the GPT-4-turbo vision), has proved extraordinary capabilities, including:
- Mathematical reasoning over numerical problems:
- Generating code from sketches:
- Description of artistic heritages:
And many others.
In this article, we are going to focus on LMMs’ vision capabilities and how they differ from the standard Computer Vision algorithms.
What is Computer Vision
Computer Vision (CV) is a field of artificial intelligence (AI) that enables computers and systems to derive…