When Will Large Vision Models Have Their ChatGPT Moment?

The launch of ChatGPT in November 2022 marked a significant turning point in natural language processing (NLP), highlighting the transformer architecture’s capabilities in understanding and generating text. Similarly, the field of computer vision is undergoing a transformation with the emergence of pre-trained large vision models (LVMs). Historically, convolutional neural networks (CNNs) dominated computer vision since around 2010, but diffusion models, like Stable Diffusion, have recently gained popularity for image generation.

A pivotal moment in model architecture occurred in 2017 with Google’s introduction of the transformer architecture, which employs an attention mechanism instead of traditional convolutional structures. This allowed for effective handling of sequences in NLP and laid the groundwork for large language models (LLMs), driving advancements in generative AI.

Transformers have since been adapted for computer vision tasks, with contributions from companies like Google and Meta. These LVMs feature both text and image processing capabilities, though they require extensive datasets for training, unlike CNNs which can be more data-efficient. However, LVMs typically offer superior contextual understanding, resulting in higher accuracy.

Experts like Srinivas Kuppa from SymphonyAI believe LVMs are on the brink of revolutionizing the computer vision market, similar to LLMs in NLP. Pre-trained models alleviate the need for initial training, making them more accessible. SymphonyAI leverages open-source LVMs, fine-tuning them with proprietary data to enhance performance for specific applications, thus delivering significant value to customers. Overall, the rise of LVMs could disrupt the computer vision landscape akin to the impact of pre-trained LLMs in the NLP space.