OpenAI envisions smaller AI models controlling large ones to prepare for superhuman AI

OpenAI believes that ‘superintelligence’ can be achieved within 10 years, a monumental AI that even goes beyond humans in terms of intelligence and reasoning. But for such a powerful AI to exist, it would also require equally robust safety measures and guidelines that keep this superhuman AI safe and beneficial for humans.

For that reason, the AI startup put together a Superalignment team earlier this year that has the sole purpose of aligning the goals and values of superintelligence with that of humans. But it has a fundamental problem, which is described by OpenAI: “We still do not know how to reliably steer and control superhuman AI systems. Future AI systems will be capable of extremely complex and creative behaviors that will make it hard for humans to reliably supervise them.”

To overcome this hurdle, OpenAI is researching the use of smaller AI models to control larger ones, and the company has already tested this theory by allowing currently existing smaller AI models to supervise larger, more capable ones. They allowed GPT 2 to control GPT 4, and they found that it was able to perform “somewhere between GPT 3 and GPT 3.5.” This proves that GPT 4’s capabilities can be used to some extent using a weaker model. More technical details can be found in OpenAI’s research paper.

However, there are certain limitations to this proof of concept, including that this system does not use ChatGPT preference data, but it does prove that: “naive human supervision—such as reinforcement learning from human feedback (RLHF)—could scale poorly to superhuman models without further work, but it is feasible to substantially improve weak-to-strong generalization.”

OpenAI acknowledges that this is only a small step toward achieving the ultimate goal of Superalignment and it could result in future models simply imitating the errors of weaker models, but it can still address key difficulties in Superalignment that could eventually be resolved with more research.

This is why OpenAI has also announced a $10 million grant program to kickstart research in this area which is meant for “graduate students, academics, and other researchers to work on superhuman AI alignment broadly.” At the same time, the company has also released open-source code to make it easier for researchers to work on weak-to-strong generalization experiments.