Skip to content Skip to footer

Getting Started with Multimodality | by Valentina Alto | Dec, 2023


Image created with Microsoft Designer

Understanding vision capabilities of Large Multimodal Models

The recent advances in Generative AI have enabled the development of Large Multimodal Models (LMMs) that can process and generate different types of data, such as text, images, audio, and video.

LMMs share with “standard” Large Language Models (LLMs) the capability of generalization and adaptation typical of Large Foundation Models. However, LMMs are capable of processing data that goes beyond text, including images, audio, and video.

One of the most prominent examples of large multimodal models is GPT4V(ision), the latest iteration of the Generative Pre-trained Transformer (GPT) family. GPT-4 can perform various tasks that require both natural language understanding and computer vision, such as image captioning, visual question answering, text-to-image synthesis, and image-to-text translation.

The GPT4V (along with its newer version, the GPT-4-turbo vision), has proved extraordinary capabilities, including:

  • Mathematical reasoning over numerical problems:
Image by the Author
  • Generating code from sketches:
Image by the Author
Image by the Author
  • Description of artistic heritages:
Image by the Author

And many others.

In this article, we are going to focus on LMMs’ vision capabilities and how they differ from the standard Computer Vision algorithms.

What is Computer Vision

Computer Vision (CV) is a field of artificial intelligence (AI) that enables computers and systems to derive…



Source link