Apple announces major developments in AI for 3D avatars and language model inference

Apple, a company widely known for its technological advancements, has just announced a milestone in the field of AI innovation with new research.

The Cupertino giant has announced two new research papers that involve creating 3D avatars and enhancing the efficiency of language model inference. Such developments have the potential to create more engaging visual experiences and bring sophisticated AI functionalities to consumer devices like iPhones and iPads, specifically Apple’s newly announced Vision Pro spatial computer.

The first paper presents the idea of using a technique called HUGS (Human Gaussian Splats) which is capable of producing animated 3D avatars from brief monocular videos, that is, videos captured from a single-lens camera.

The lead author for the research paper, Muhammed Kocabas, said: “Our method takes only a monocular video with a small number of (50-100) frames, and it automatically learns to disentangle the static scene and a fully animatable human avatar within 30 minutes.”

HUGS employs the technique of 3D Gaussian splatting for efficient rendering, representing both the human figure and the background environment. This method begins with a foundational human model derived from SMPL, a statistical body shape model. However, HUGS offers flexibility for the Gaussians to vary, which is crucial for capturing intricate details such as clothing and hair.

Using bilinear blend skinning, the AI tool is able to animate the Gaussians realistically thanks to a novel neural deformation module. This process ensures smooth and artifact-free repositioning of the avatar. Kocabas notes that HUGS is adept at generating new poses for humans and new viewpoints for both humans and their surrounding scenes.

According to the research paper, HUGS is 100 faster at training and rendering compared to current generation avatar creation models. It can create photorealistic results after about 30 minutes of processing on a gaming-grade GPU. In terms of 3D reconstruction quality is also able to outperform top-of-the-line techniques such as Vid2Avatar and NeuMan.

Solving the Memory Problem in AI Inference

The second paper from Apple discusses the problem of running Large Language Models (LLMs) like GPT 4 on devices with low memory. This is particularly challenging as modern LLMs run on billions of parameters, which can be taxing for computer hardware.

The paper proposes minimizing the data transfer from flash storage into scarce DRAM during inference. The paper’s lead author Keivan Alizadeh explained: “Our method involves constructing an inference cost model that harmonizes with the flash memory behavior, guiding us to optimize in two critical areas: reducing the volume of data transferred from flash and reading data in larger, more contiguous chunks.”

The research introduces two key methods: “Windowing” and “row-column bundling.” Windowing involves recycling activations from recent inferences, and row-column bundling enables reading larger data blocks by collectively storing rows and columns. These techniques have been shown to significantly enhance performance. On an Apple M1 Max CPU, they achieve an inference latency improvement of 4-5 times compared to basic loading methods. When applied to a GPU, the speed enhancement is even more pronounced, reaching a factor of 20-25 times.

The paper’s co-author Mehrdad Farajtabar said:

“This breakthrough is particularly crucial for deploying advanced LLMs in resource-limited environments, thereby expanding their applicability and accessibility.”

This technique could soon allow consumer devices such as smartphones and laptops to run LLMs natively, meaning they would no longer require an internet connection to run AI like ChatGPT.