Apple researchers have just shared a new research paper that details a new multimodal AI called MM1 that comes with rich visual capabilities. In terms of performance metrics, it is capable of rivaling OpenAI’s GPT 4 and Google’s Gemini AI thanks to sophisticated training and intelligent architecture.

Since it is a multimodal model, MM1 is capable of combining visual and language understanding to enable advanced capabilities.

MM1, akin to GPT-4V and Gemini, is constructed using the Large Language Model (LLM) framework. It underwent training with a diverse dataset that includes image-text pairs, documents featuring a mix of images and text, and purely textual information. The distribution of the training material was 45% image-text pairs, 45% documents with interleaved images and text, and 10% exclusively text-based data.

The training approach undertaken for MM1 has equipped it with skills comparable to those of its competitors. These include the ability to describe images, respond to questions, and solve elementary mathematical challenges.

The research demonstrated that the most substantial model, with 30 billion parameters, displayed a remarkable aptitude for learning from just a few examples and making deductions across multiple images.

It was observed that enhancing the model’s capacity to process images significantly improved its overall performance.

With an expansion to 30 billion parameters and the implementation of Mixture-of-Experts (MoE) models—a unique method where several specialized AI models collaborate—MM1 has attained cutting-edge outcomes. It surpasses the majority of existing models in tasks such as few-shot learning for image captioning and visual question answering.

Furthermore, MM1 demonstrates proficiency in more intricate situations like multi-image reasoning, allowing it to amalgamate data from several images to tackle complex queries or reach deductions that a single image wouldn’t suffice for. Such an ability could enable MM1 to grasp and decode the real world in a manner akin to human cognition and analytical thinking.

The level of transparency and the understated manner in which Apple unveiled this model marks a significant shift from its usual approach of secrecy. This change represents a substantial victory for the open-source community.

Via: The Decoder