OpenAI is no longer the frontrunner when it comes to state-of-the-art multimodal AI models. Last week, Google’s AI research lab, DeepMind, introduced a new group of models called Gemini that outperforms the best OpenAI models when evaluated on standard benchmark datasets — including those for automatic speech recognition and translation.

The three Gemini models — Ultra (for highly complex tasks), Pro (for a broader range of tasks), and Nano (for on-device tasks) — are already being implemented on Google products, starting with Nano on the Pixel smartphone and Pro on the Bard chatbot. 

“This new era of models represents one of the biggest science and engineering efforts we’ve undertaken as a company,” wrote Google’s CEO Sundar Pichai in a blog post. “I’m genuinely excited for what’s ahead, and for the opportunities Gemini will unlock for people everywhere.”

Despite a misleading demo video, Gemini’s capabilities are impressive. Gemini Ultra surpassed OpenAI’s flagship model, GPT-4, on myriad tasks such as coding, natural image understanding, and mathematical reasoning. 

Compared to OpenAI’s Whisper v3 multilingual model, Gemini Pro performed better at automatic speech recognition, receiving a lower word error rate when tested on a 62-language benchmark. Similarly, Gemini Pro beat Whisper v2 at automatic speech translation based on a 21-language benchmark. 

In the blog post, Google DeepMind CEO Demis Hassabis explained the lab’s approach to developing a truly multimodal system: “We designed Gemini to be natively multimodal, pre-trained from the start on different modalities… This helps Gemini seamlessly understand and reason about all kinds of inputs from the ground up, far better than existing multimodal models.”

Still, critics point out that Google’s performance gains over OpenAI are relatively small, indicating that the race to the best AI system is far from over. With challengers including Meta and Samsung, the competition is sure to heat up even more in 2024.


