Switch to
OpenAI
May 2, 2025
A deeper dive on our findings, what went wrong, and future changes we’re making.
Loading…
Share
On April 25th, we rolled out an update to GPT‑4o in ChatGPT that made the model noticeably more sycophantic. It aimed to please the user, not just as flattery, but also as validating doubts, fueling anger, urging impulsive actions, or reinforcing negative emotions in ways that were not intended. Beyond just being uncomfortable or unsettling, this kind of behavior can raise safety concerns—including around issues like mental health, emotional over-reliance, or risky behavior.
We began rolling that update back on April 28th, and users now have access to an earlier version of GPT‑4o with more balanced responses. Earlier this week, we shared initial details about this issue—why it was a miss, and what we intend to do about it.
We didn’t catch this before launch, and we want to explain why, what we’ve learned, and what we’ll improve. We’re also sharing more technical detail on how we train, review, and deploy model updates to help people understand how ChatGPT gets upgraded and what drives our decisions.
We’re continuously working to develop improvements on the models in ChatGPT, which we call mainline updates. Since launching GPT‑4o in ChatGPT last May, we’ve released five major updates(opens in a new window) focused on changes to personality and helpfulness. Each update involves new post-training, and often many minor adjustments to the model training process are independently tested and then combined into a single updated model which is then evaluated for launch.
To post-train models, we take a pre-trained base model, do supervised fine-tuning on a broad set of ideal responses written by humans or existing models, and then run reinforcement learning with reward signals from a variety of sources.
During reinforcement learning, we present the language model with a prompt and ask it to write responses. We then rate its response according to the reward signals, and update the language model to make it more likely to produce higher-rated responses and less likely to produce lower-rated responses.
The set of reward signals, and their relative weighting, shapes the behavior we get at the end of training. Defining the correct set of reward signals is a difficult question, and we take many things into account: are the answers correct, are they helpful, are they in line with our Model Spec(opens in a new window), are they safe, do users like them, and so on. Having better and more comprehensive reward signals produces better models for ChatGPT, so we’re always experimenting with new signals, but each one has its quirks.
Once we have a model candidate, our models go through a deployment process to check safety, model behavior, and helpfulness. Currently, evaluations fall into these categories:
In the April 25th model update, we had candidate improvements to better incorporate user feedback, memory, and fresher data, among others. Our early assessment is that each of these changes, which had looked beneficial individually, may have played a part in tipping the scales on sycophancy when combined. For example, the update introduced an additional reward signal based on user feedback—thumbs-up and thumbs-down data from ChatGPT. This signal is often useful; a thumbs-down usually means something went wrong.
But we believe in aggregate, these changes weakened the influence of our primary reward signal, which had been holding sycophancy in check. User feedback in particular can sometimes favor more agreeable responses, likely amplifying the shift we saw. We have also seen that in some cases, user memory contributes to exacerbating the effects of sycophancy, although we don’t have evidence that it broadly increases it.
One of the key problems with this launch was that our offline evaluations—especially those testing behavior—generally looked good. Similarly, the A/B tests seemed to indicate that the small number of users who tried the model liked it. While we’ve had discussions about risks related to sycophancy in GPT‑4o for a while, sycophancy wasn’t explicitly flagged as part of our internal hands-on testing, as some of our expert testers were more concerned about the change in the model’s tone and style. Nevertheless, some expert testers had indicated that the model behavior “felt” slightly off.
We also didn’t have specific deployment evaluations tracking sycophancy. While we have research workstreams around issues such as mirroring and emotional reliance, those efforts haven’t yet become part of the deployment process. After this rollback, we’re integrating sycophancy evaluations into that process.
We then had a decision to make: should we withhold deploying this update despite positive evaluations and A/B test results, based only on the subjective flags of the expert testers? In the end, we decided to launch the model due to the positive signals from the users who tried out the model.
Unfortunately, this was the wrong call. We build these models for our users and while user feedback is critical to our decisions, it’s ultimately our responsibility to interpret that feedback correctly. Looking back, the qualitative assessments were hinting at something important, and we should’ve paid closer attention. They were picking up on a blind spot in our other evals and metrics. Our offline evals weren’t broad or deep enough to catch sycophantic behavior—something the Model Spec explicitly discourages(opens in a new window)—and our A/B tests didn’t have the right signals to show how the model was performing on that front with enough detail.
With the recent GPT‑4o update, we started the rollout on Thursday, April 24th and completed it on Friday, April 25th. We spent the next two days monitoring early usage and internal signals, including user feedback. By Sunday, it was clear the model’s behavior wasn’t meeting our expectations.
We took immediate action by pushing updates to the system prompt late Sunday night to mitigate much of the negative impact quickly, and initiated a full rollback to the previous GPT‑4o version on Monday. The full rollback took around 24 hours to manage stability and avoid introducing new issues across the deployment.
Today, GPT‑4o traffic is now using this previous version. Since the rollback, we’ve been working to fully understand what went wrong and make longer-term improvements.
This launch taught us a number of lessons. Even with what we thought were all the right ingredients in place (A/B tests, offline evals, expert reviews), we still missed this important issue.
Here are the big takeaways we’re taking forward:
One of the biggest lessons is fully recognizing how people have started to use ChatGPT for deeply personal advice—something we didn’t see as much even a year ago. At the time, this wasn’t a primary focus, but as AI and society have co-evolved, it’s become clear that we need to treat this use case with great care. It’s now going to be a more meaningful part of our safety work. With so many people depending on a single system for guidance, we have a responsibility to adjust accordingly. This shift reinforces why our work matters, and why we need to keep raising the bar on safety, alignment, and responsiveness to the ways people actually use AI in their lives.