OpenAI is adding voice and image capabilities to its AI chatbot, ChatGPT, the company announced today. The chatbot, which has grown to become the most widely used generative AI product, will now be able to see, hear, and speak, enabling users to have conversations with images and speech, and allowing the chatbot to talk back and share responses. It is arguably one of the biggest ChatGPT updates yet.

The voice feature, which essentially works like Alexa, serves as an alternative to typing, allowing users to have back-and-forth conversations with ChatGPT. What may set it apart from the likes of Alexa, Siri, and Google Assistant are the powerful large language models that power it. In less than a year since its launch, ChatGPT has demonstrated significantly greater utility compared to these other virtual assistants. There are five different voice options that users can choose from.

The image feature, according to the video shared by OpenAI, looks like a more powerful version of Google Lens, and will let users show ChatGPT photos to get information or advice. For example, users could show a photo of their fridge to get meal ideas. The image understanding is powered by OpenAI’s GPT models.

The new features will be rolled out to Plus and Enterprise customers over the next two weeks, with plans for expansion to all users at a later date. The image feature will be accessible on all the platforms, including the web, while voice functionality will be limited to iOS and Android, requiring user opt-in.

“Voice and image give you more ways to use ChatGPT in your life. Snap a picture f a landmark while traveling and have a live conversation about what’s interesting about it. When you’re home, snap pictures of your fridge and pantry to figure out what’s for dinner (and ask follow up questions for a step by step recipe). After dinner, help your child with a math problem by taking a photo, circling the problem set, and having it share hints with both of you,” stated the company in a blog post, sharing possible use cases for the new features.

Logan Kilpatrick, company’s Developer Relations advocate, commenting on the launch, said that he has been playing with the new capabilities and they are “truly mind blowingly good.”

He also confirmed that the new capabilities will also be made available for developers through APIs in the future.

The announcement follows reports that the AI lab was racing to introduce multimodal capabilities, dubbed GPT-Vision, in ChatGPT, in order to beat the debut of Google Gemini. The announcement by the company doesn’t have any reference to GPT-Vision but this is what multimodal capabilities look like, as the chatbot can now analyze and generate content that combines textual, visual, and audio information.

The latest version of the company’s image generation model, DALL-E 3, has also been integrated natively into ChatGPT, enhancing the chatbot’s capabilities for visual content creation. DALL-3 was announced last week and is expected to make its debut for ChatGPT Plus and Enterprise customers over the next few days.

The company’s latest voice technology, it claims, can create lifelike synthetic voices from just a few seconds of speech, opening the door to creative and accessibility applications. However, there are also risks, like impersonation and fraud. To address this, they’re using the tech in a specific use case: ChatGPT’s voice chat. OpenAI is also collaborating with other companies, including Spotify which is using its tech for a voice translation feature, which helps podcasters translate their content into different languages using their own voices.

OpenAI’s vision-based models also posed unique challenges, such as generating false images and model interpretation in critical domains. The company said that it has done rigorous testing with red teamers and alpha testers to align on a few key details for responsible usage.