OpenAI races to launch multimodal LLM GPT-Vision, aiming to beat Google Gemini’s debut

OpenAI has been releasing products at an astonishing speed, and with these launches, it has maintained the image of an AI leader that it has built since the debut of its AI chatbot, ChatGPT.

With the imminent launch of Gemini, Google’s latest large language model expected to debut this fall and reportedly undergoing testing with select enterprise clients, it initially appeared that Google might take the lead for the first time. However, it now appears that OpenAI intends to disrupt Google’s plans.

The AI lab, according a report by The Information is aiming to launch multimodal LLM, its next generation of large language models, codenamed Gobi, in an effort to beat Google and maintain its lead.

A multimodal large language model is essentially an advanced AI system that can process and comprehend multiple forms of data, such as text and images. Unlike traditional language models that work primarily with text, multimodal LLMs have the ability to analyze and generate content that combines both textual and visual information.

What this means is that they can interpret images, understand context, and produce text or responses that incorporate both text and visual input. Multimodal LLMs are highly versatile and suitable for various applications, ranging from natural language understanding to image interpretation and beyond, offering a broader scope of information processing capabilities.

“These models can work with images and text alike, producing code for a website just by seeing a sketch of what a user wants the site to look like, for instance, or spitting out a text analysis of visual charts so you don’t have to ask your engineer friend what these ones mean,” notes the report.

According to a report citing an undisclosed source familiar with the matter, OpenAI is actively working to incorporate multimodal capabilities, similar to what Gemini is expected to offer, into GPT-4.

The Microsoft-backed firm had showcased those features during the launch of GPT-4 but had limited their availability to just one company, Be My Eyes, a company that assists individuals with visual impairments or blindness in their daily activities through its mobile app. It is now preparing to roll out these features, dubbed as GPT-Vision, to a wider audience.

Sam Altman, OpenAI’s CEO, has hinted in various recent interviews that GPT-5 is not on the horizon, but they are planning to make various enhancements to GPT-4, and this may be one of them.

In an interview with Wired last week, Google CEO Sundar Pichai conveyed his confidence in Google’s current standing in AI, acknowledging the enduring nature of technological progress and their deliberate strategy in balancing innovation with responsibility. He also acknowledged OpenAI’s ChatGPT launch, crediting it for demonstrating a product-market fit and readiness among users for AI technology, while emphasizing Google’s cautious approach due to the trust and responsibility associated with its products.