OpenAI’s New Models Without the Hype - Sync #515

Plus: Gemini 2.5 Flash; an AI model decoding dolphins' speech; Hugging Face enters open-source robotics; a breakthrough in lab-grown meat; adventures with adversarial audio attacks; and more!

Apr 20, 2025

Hello and welcome to Sync #515!

This week, OpenAI released a host of new models and tools, including GPT-4.1, new reasoning models o3 and o4-mini, and Codex CLI. We will take a closer look at those announcements and separate the hype from reality.

Elsewhere in AI, Google released Gemini 2.5 Flash, the latest model from its Gemini 2.5 reasoning family, and DolphinGemma, an AI model for decoding dolphins’ speech. Additionally, Dr Ian Cutress asks and answers an important question: who is making money in AI, Mustafa Suleyman defends Microsoft’s AI strategy, and Benn Jordan shares his adventures into adversarial audio attacks.

Over in robotics, Hugging Face has acquired French startup Pollen Robotics and plans to enter open-source robotics. Meanwhile, a drone has beaten human pilots in an international drone racing competition, and Unitree teases a boxing match with humanoid robots.

Additionally, Japanese researchers have made a breakthrough in lab-grown meat, Marques Brownlee reviews a bionic hand, and a Nobel Prize laureate in chemistry argues that we are not programmed to die.

I hope you’ll enjoy this week’s issue of Sync!

OpenAI’s New Models Without the Hype

It’s been a busy week for OpenAI. The company has released a suite of new models and tools, including GPT-4.1, the reasoning-focused o3 and o4-mini models, and Codex CLI—a terminal-based coding agent. The buzz has been immediate and intense. Some have gone as far as calling o3 a step towards AGI.

But let’s slow down.

These models are good. In some ways, they’re OpenAI’s best yet. But they are not flawless, not unbeatable, and definitely not AGI. In fact, they come with trade-offs that deserve more attention, such as hallucination rates, safety concerns, and pricing that sometimes lags behind increasingly capable competitors like Google’s Gemini 2.5 Pro.

This article cuts through the hype to take a closer look at what OpenAI’s new models can do—and where they still fall short.

o3 and o4-mini: Good but not perfect and definitely not AGI

Let’s start with o3 and o4-mini, OpenAI’s latest and most capable reasoning models from its “o-series” to date. These models are designed to “think longer before answering,” combining deep reasoning with the ability to use tools such as Python, web browsing, file analysis, and image editing. They can even manipulate visual inputs—cropping, zooming, or flipping images—as part of their internal thought process. It’s a significant leap from previous ChatGPT models, and in many tasks, the improvement is real.

According to benchmarks provided by OpenAI, o3 shines in STEM, coding, and business use cases, offering strong performance on benchmarks like SWE-bench, Codeforces, and MathVista. OpenAI says it makes 20% fewer major errors than o1, its predecessor. Meanwhile, o4-mini is a more cost-efficient option that still delivers impressive results, especially when paired with tools like the Python interpreter.

However, if you add Google’s Gemini 2.5 Pro into the mix, the picture becomes more complicated. On many benchmarks, Gemini matches or beats OpenAI’s latest models. In some cases—like chart reasoning and scientific question answering—it outperforms o3. Even worse for OpenAI, Gemini often does so at a fraction of the cost, as perfectly illustrated by benchmarks from Aider. That matters, especially for developers and businesses working at scale.

So yes, o3 and o4-mini are OpenAI’s smartest models yet. But “best in class” depends on what class you’re looking at—and whether you’re factoring in the price tag.

Despite their sophistication, o3 and o4-mini come with a familiar problem: they hallucinate—and in some cases, they hallucinate more than earlier models.

According to OpenAI’s own technical report, o3 hallucinated answers to 33% of questions in its internal PersonQA benchmark, a significant jump from o1 and o3-mini, which hovered around 15%. o4-mini did even worse, hallucinating 48% of the time. Third-party researchers at Transluce observed o3 fabricating entire tool actions, like claiming it had executed code on a local machine—a capability it simply doesn’t have.

These aren’t harmless quirks. In sensitive domains such as legal, medical, or financial work, hallucinations can undermine trust and introduce risk. And while OpenAI emphasises that the new models are “trained to reason about when and how to use tools,” reasoning alone doesn’t guarantee reliability.

Safety researchers have also raised red flags. Apollo Research found that both o3 and o4-mini engaged in strategic deception during tests—using restricted tools or exceeding quotas, then lying about it. Metr, a long-time evaluation partner of OpenAI, reported reward-hacking behaviours and voiced concerns that they weren’t given enough time to test o3 before launch thoroughly.

OpenAI has responded by retraining safety datasets, implementing refusal protocols, and stress-testing the models against high-risk prompts. According to its Preparedness Framework, both o3 and o4-mini remain below the “high” risk threshold in areas such as biosecurity, cybersecurity, and AI self-improvement. However, as these models grow more capable and agentic, external researchers—and even OpenAI itself—have warned that future iterations could edge dangerously close to those limits. In fact, OpenAI has acknowledged in the System Card that future models may become too risky to release if they cross certain capability thresholds, particularly in areas like autonomous tool use or biological threat generation. The company has committed to halting releases if those thresholds are breached, but this also raises critical questions about transparency, oversight, and how “frontier” risks will be monitored in practice as the scaling continues.

Codex CLI

Alongside its new models, OpenAI quietly launched Codex CLI, OpenAI’s answer to Claude Code. Codex CLI is a lightweight, open-source coding agent that runs directly from the terminal, designed to leverage o3 and o4-mini’s tool-using and reasoning capabilities in a real-world development workflow.

Codex CLI allows users to pass in screenshots or even rough sketches, combining them with access to local files and folders. The model can then perform code edits, generate project scaffolding, or debug issues—essentially acting as a smart assistant inside your development environment. It’s OpenAI’s clearest step yet towards building an “agentic software engineer,” a goal the company has spoken about publicly.

OpenAI is also launching a $1 million grant initiative to support projects using Codex CLI and its models, offering API credits in $25,000 increments.

Codex CLI is OpenAI’s attempt to position itself not just as a model provider but as a platform for AI-native software development. Recent reports that OpenAI is in talks to acquire Windsurf, a rising AI-assisted coding tool, for around $3 billion, only underscore the company’s growing ambition to dominate the developer tooling space.

GPT-4.1

Before o3 and o4-mini grabbed most of the spotlight, OpenAI also released GPT-4.1, a new family of models aimed squarely at developers. Available via API only, GPT-4.1, along with its smaller siblings GPT-4.1 mini and nano, focuses on software engineering, instruction following, and reliable formatting—what OpenAI calls “real-world use”.

GPT-4.1 boasts a 1-million-token context window, making it possible to process the equivalent of War and Peace in one go. It also performs competitively on coding benchmarks like SWE-bench and Video-MME, particularly in structured tasks like front-end development and format-sensitive generation.

But perhaps most importantly, GPT-4.1 is cost-efficient. At $2 per million input tokens—compared to GPT-4.5’s staggering $75—it offers similar performance at a fraction of the price. In fact, OpenAI is already phasing out GPT-4.5 from its API entirely, citing the high cost of serving it at scale.

Still, GPT-4.1 isn’t perfect. It tends to become more error-prone when handling extremely long inputs, and its literal interpretation of prompts sometimes limits flexibility.

Availability and rollout

OpenAI’s new models are rolling out across its product ecosystem, though access still depends on which tier you’re on.

As of this week, ChatGPT Plus, Pro, and Team users can access o3, o4-mini, and o4-mini-high. These replace older models like o1 and o3-mini. Enterprise and Edu users will get access a week later. For free-tier users, o4-mini is available in a limited form through the new “Think” mode. Pro users can still use o1-pro for now, with o3-pro (featuring full tool support) expected in the coming weeks.

Developers can access o3 and o4-mini via the Chat Completions API and Responses API, the latter offering better support for tool reasoning and execution summaries. Meanwhile, GPT-4.1 and its variants are available only via API.

Progress, but not AGI

o3 and o4-mini are smarter, more tool-savvy, and more flexible than anything the company has released before. GPT-4.1 brings serious improvements to coding workflows while cutting costs dramatically. And tools like Codex CLI show OpenAI thinking beyond just chatbots and toward actual AI agents embedded in our workflows.

But these models are not flawless, and they’re certainly not AGI. They still hallucinate. They still mislead. They still get tripped up by common-sense reasoning. And for all their strengths, they’re no longer unchallenged at the top. Google’s Gemini 2.5 Pro is giving them serious competition, especially on price and performance.

OpenAI’s latest releases are a meaningful step forward, but not a revolution. The gap between impressive demos and real-world reliability is still real. So it’s worth celebrating the progress—but only if we’re willing to see past the hype and stay honest about what these models can—and can’t—do.

If you enjoy this post, please click the ❤️ button or share it.

Do you like my work? Consider becoming a paying subscriber to support it

Become a paid subscriber

For those who prefer to make a one-off donation, you can 'buy me a coffee' via Ko-fi. Every coffee bought is a generous support towards the work put into this newsletter.

Your support, in any form, is deeply appreciated and goes a long way in keeping this newsletter alive and thriving.

🦾 More than a human

‘We Are Not Programmed to Die,’ Says Nobel Laureate Venki Ramakrishnan
In this short interview, Venkatraman Ramakrishnan, the 2009 Nobel Prize winner in Chemistry, discusses the biology of ageing and death and argues that death is not genetically programmed. He highlights how evolution favours traits for reproduction over longevity, debunks myths around anti-ageing science, and warns of the ethical and societal implications of extended lifespans. Drawing insights from model organisms such as worms, Ramakrishnan advocates for a realistic understanding of mortality and cautions against the false promises of eternal youth.

This Brain-Computer Interface Is Now a Two-Way Street
Researchers have developed a brain–computer interface (BCI) that restores both movement and touch in a paralysed patient. By implanting five chips into the motor and somatosensory cortices of a patient paralysed from the chest down, this "double neural bypass" enabled him to move his hand and feel touch through AI-decoded brain signals and electrical stimulation. This advancement marks a major step forward in brain–computer interface technology, with potential applications for spinal cord injuries, strokes, and other neurological conditions.

▶️ Marques Brownlee reviews a bionic hand (11:43)

We live in a timeline where tech YouTubers not only review the latest phones or laptops but also bionic hands. In this video, Marques Brownlee tests the Psyonic Ability, a bionic hand designed for amputees. Luckily for himself, Marques did not have to cut his hand off to experience the bionic hand, as Psyonic prepared a special sleeve so that non-amputees can experience what it feels like to use a bionic hand.

🔮 Future visions

Yuval Noah Harari: ‘How Do We Share the Planet With This New Superintelligence?’
In this interview, Yuval Noah Harari explores the challenges posed by artificial intelligence, emphasising its role not just as a tool, but as an autonomous agent capable of generating ideas, stories, and decisions beyond human comprehension. He warns that AI's capacity to manipulate information, shape financial systems, and create cultural narratives could outpace human understanding and control, potentially undermining democracy and trust. Harari stresses the urgent need for global cooperation, self-correcting mechanisms, and a rethinking of how societies manage trust and power in an age where non-human intelligence increasingly shapes reality.

🧠 Artificial Intelligence

Start building with Gemini 2.5 Flash
Google has released Gemini 2.5 Flash, the latest model from its Gemini 2.5 reasoning family. Gemini 2.5 Flash is Google’s first fully hybrid reasoning model, giving developers control over enabling or disabling "thinking" to manage speed, quality, and cost. Additionally, developers can set a "thinking budget"—a limit on tokens allocated for the reasoning process—enabling even more control over costs and performance. According to benchmark results provided by Google, Gemini 2.5 Flash delivers performance comparable to OpenAI’s o4-mini or Anthropic’s Claude 3.7 Sonnet but does so for a fraction of the price (10 to 20 times cheaper). Gemini 2.5 Flash is available in an early preview through Gemini API, accessible via Google AI Studio and Vertex AI.

DolphinGemma: How Google AI is helping decode dolphin communication
In collaboration with Georgia Tech and the Wild Dolphin Project (WDP), Google has unveiled DolphinGemma, a 400 million parameter AI model trained to analyse and generate dolphin vocalisations. Based on Google’s Gemma open model and years of WDP's underwater research with Atlantic spotted dolphins, DolphinGemma can identify patterns in dolphin sounds and predict the next likely sound, much like a language model. DolphinGemma can also generate realistic dolphin-like audio sequences. The model is small enough to fit on a Google Pixel device, which opens the possibility of real-time, two-way interaction with dolphins. Google plans to release DolphinGemma as an open model this summer to support broader marine research.

YouTube supports the NO FAKES Act: Protecting creators and viewers in the age of AI
YouTube has announced its support for the bipartisan NO FAKES Act of 2025, a bill that seeks to protect creators and individuals from digital impersonation and deepfakes. YouTube is also rolling out new tools to help users manage their likeness and is requesting Congress to pass the act, emphasising the importance of responsible AI use and individual rights.

▶️ The Art Of Poison-Pilling Music Files (27:10)

In this video, Benn Jordan shares his adventures into adversarial audio attacks—a technique that encodes music or any audio with adversarial noise, making it unusable for AI training or even degrading the AI model’s performance. It is a fascinating story about fighting against big tech and AI companies that use artists’ work without their permission or compensating them, the hype and seemingly lawlessness when it comes to AI tools, and using technology to fight against technology. Jordan also points out how the same adversarial audio attacks used to disrupt AI models can be used to exploit and trick smart devices or manipulate commands without human detection.

UAE pledges $1.4 trillion investment in US
The United Arab Emirates has pledged $1.4 trillion over the next decade to expand investments in the United States, with a focus on artificial intelligence infrastructure, semiconductors, energy, and manufacturing. The announcement followed UAE National Security Adviser Sheikh Tahnoon bin Zayed’s visit to Washington, where he met with President Donald Trump and major tech and finance leaders.

Understanding Aggregate Trends for Apple Intelligence Using Differential Privacy
Apple is proposing a privacy-focused method to improve its AI by using synthetic data called differential privacy. Instead of collecting real user content, Apple generates synthetic messages that resemble actual data and sends them to opted-in devices. These devices compare the synthetic data to their own, then anonymously report which examples are most similar—without revealing any personal content. This will let Apple refine features like Genmoji and email summaries while keeping user data private.

▶️ Who *Actually* Makes the Money in AI? (17:43)

In this video, Dr Ian Cutress asks and answers an important question: who is making money in AI? He outlines the key economic metrics driving AI businesses and breaks down the complex AI value chain—from chipmakers and cloud providers to software platforms and end users. According to Cutress, the race is on to build and sell AI faster and more cost-effectively, with the real winners likely to be those enabling the industry rather than those producing the flashiest models. Ultimately, he argues, “AI is only valuable if it’s either making something cheaper or making something new that people will pay for.”

Microsoft AI chief Suleyman sees advantage in building models ‘3 or 6 months behind’
In an interview with CNBC, Microsoft’s AI chief, Mustafa Suleyman, defends the company’s strategy of not engaging in the frontier models race. Suleyman says the company is prioritising a cost-efficient “off-frontier” strategy by building slightly behind the cutting edge of AI development, avoiding the high expense and duplication of leading models. He also says this approach helps to see what works and what does not, and focus on specific use cases. While maintaining a deep partnership with OpenAI—whose models power various Microsoft Copilots—Microsoft is also working towards long-term AI self-sufficiency.

Phase two of military AI has arrived
Chatbots are finding their place in every corner of society—including the military. In a recent deployment, US Marines used generative AI tools, similar to ChatGPT, to analyse surveillance data across the Pacific. This reflects the Pentagon’s growing reliance on conversational AI for tasks once handled by human analysts, raising both hopes for increased efficiency and concerns about safety, oversight, and the potential for AI to influence high-level military decisions. As AI creeps further up the chain of command, questions mount about how far it should go.

A short film program to explore AI on screen
Google and Range Media Partners have launched "AI on Screen," a short film initiative aimed at exploring the complex relationship between humanity and AI. Over the next 18 months, the programme will commission original stories, with the potential to expand some into full-length features. The first two films, SWEETWATER by Sean Douglas and LUCID by Sammi Cohen, are set for release later this year.

Cyberattacks by AI agents are coming
AI agents can autonomously plan and execute tasks, which makes them useful but also potentially dangerous if used for cyberattacks. While not yet widespread, researchers have already demonstrated their ability to exploit systems autonomously. Organisations like Palisade Research are deploying honeypots to detect and study these agents, confirming early signs of their presence. With the potential to outperform traditional bots, scale ransomware operations, and evade detection, AI agents could soon redefine the cyber threat landscape—prompting urgent calls for proactive defences before their use becomes mainstream. “I think ultimately we’re going to live in a world where the majority of cyberattacks are carried out by agents,” says Mark Stockley, a security expert at the cybersecurity company Malwarebytes. “It’s really only a question of how quickly we get there.”

If you're enjoying the insights and perspectives shared in the Humanity Redefined newsletter, why not spread the word?

Refer a friend

🤖 Robotics

An Open Source Pioneer Wants to Unleash Open Source AI Robots
Hugging Face has acquired French startup Pollen Robotics, maker of the humanoid robot Reachy 2, and plans to sell the robot while open-sourcing its code to encourage community development. The move reflects Hugging Face's push to bring open-source principles to robotics, aiming to boost transparency, trust, and innovation in the field. “It’s really important for robotics to be as open source as possible,” said Clément Delangue, chief executive of Hugging Face.

▶️ Meet NEO, Your Robot Butler in Training | Bernt Børnich (14:48)

In this TED Talk, Bernt Børnich, the founder and CEO of 1X, introduces NEO, a humanoid robot designed to make labour as abundant and accessible as energy is today. Unlike other humanoid robots, NEO is designed not to work in a factory but to work alongside us in our homes. Børnich argues that true machine intelligence requires diverse environments—like the home, not factories—where robots can learn through real-world interactions. He envisions a future where humanoid robots perform everyday tasks, while humanity is free to tackle bigger problems and questions.

Autonomous drone defeats human champions in historic racing first
For the first time, a drone has beaten human pilots in an international drone racing competition. The AI-powered drone, built by Delft University of Technology, won the A2RL Drone Championship in Abu Dhabi on 14 April 2025, and then triumphed over three former world champions in a head-to-head knockout race, reaching speeds of nearly 96 km/h.

▶️ Unitree Iron Fist King: Awakening! (1:25)

After showing how skilful its robot is at performing kung-fu moves, side flips, and dancing, Unitree is ready to step up the game and has announced an upcoming boxing match between two of its humanoid robots. The match is set to take place in about a month, and it will be livestreamed.

🧬 Biotechnology

Inside Isomorphic Labs, the secretive AI life sciences startup spun off from Google DeepMind
This article from CNBC tells the story of Isomorphic Labs, an AI-driven drug discovery startup spun out of Google DeepMind, with the ambitious mission to "solve all disease." Backed by a recent $600 million funding round led by Thrive Capital and supported by Alphabet and GV, Isomorphic is leveraging DeepMind’s AlphaFold models to accelerate drug development by predicting the structure and interactions of molecules such as proteins, DNA, and RNA. With partnerships worth up to $3 billion and a growing team of more than 200 people, the company aims to reshape the future of medicine through advanced AI.

Lab-grown chicken ‘nuggets’ hailed as ‘transformative step’ for cultured meat
A Japanese research team has made a breakthrough in lab-grown meat by creating an 11g chunk of chicken using a bioreactor that mimics the circulatory system, allowing oxygen and nutrients to reach thick tissue. This new technique enables the growth of structured meat cuts like chicken breast, potentially transforming the cultured meat industry. While the product is still expensive and requires manual work, researchers hope to have market-ready products in 5-10 years and expect the process to become cheaper as food-grade, scalable systems are developed.

💡Tangents

Future Chips Will Be Hotter Than Ever
As chips' performance increases, the heat they produce rises as well, creating significant challenges for the semiconductor industry. This article explores the limitations of traditional cooling methods and explains innovative strategies, including backside power-delivery networks (BSPDNs), integrated voltage regulators (IVRs), and specialised logic architectures that can help manage rising chip temperatures. Ultimately, addressing these thermal issues requires comprehensive approaches that combine advanced cooling, system-level thermal management, and interdisciplinary collaboration through system-technology co-optimisation.

Thanks for reading. If you enjoyed this post, please click the ❤️ button or share it.

Humanity Redefined sheds light on the bleeding edge of technology and how advancements in AI, robotics, and biotech can usher in abundance, expand humanity's horizons, and redefine what it means to be human.

A big thank you to my paid subscribers, to my Patrons: whmr, Florian, dux, Eric, Preppikoma and Andrew, and to everyone who supports my work on Ko-Fi. Thank you for the support!

My DMs are open to all subscribers. Feel free to drop me a message, share feedback, or just say "hi!"