OpenAI gave METR early access to GPT-5.6 Sol for testing including raw chain-of-thought, a railfree version of the model, and internal information about the model. With this access, METR conducted a pre-deployment evaluation of GPT-5.6 Sol, including an attempted measurement of its 50%-Time Horizon. However, the measurement depends heavily on our treatment of cheating attempts, and GPT-5.6 Sol’s detected cheating rate was higher than any public model we have evaluated.
If we follow our standard methodology of marking cheating attempts as failures, we arrive at a 50%-Time Horizon point estimate of around 11.3hrs (95% CI: 5hrs - 40hrs), but if we count the cheating attempts as legitimate successes, the point estimate jumps beyond 270hrs.
This makes us uncertain about GPT-5.6 Sol’s time horizon, but additional information provided by OpenAI and the long-term trend in AI capabilities lead us to believe this model does not pose catastrophic risks from fully automated AI R&D.
The information provided by OpenAI also included reports of incidents observed during their internal usage and testing. In one example, an instance of the model instructed another instance to conceal evidence of misalignment.
Our testing focused on measuring model capabilities rather than alignment, as we think capability is a more important limiting factor for catastrophic loss-of-control risk for current models, but we expect alignment to be increasingly important as capabilities improve.
We noted from our observations and the incidents that OpenAI shared with us that the model had some overt undesirable propensities, including cheating and concealing misbehavior.
However, we consider this to be a reassuring sign about OpenAI’s *ability to catch* catastrophic misalignment, as it suggests that more concerning tendencies (such as systematic powerseeking and alignment faking) would also be detected. That is, these undesirable propensities being detected and reported (and manifesting fairly overtly) is a positive sign about some of OpenAI’s safety practices, particularly:
* Refraining from training against the chain of thought (to reduce pressure for the model to conceal its intentions)
* Extensive monitoring of internal deployments that surfaced relevant incidents
* Sharing information about internal incidents with METR
If future models display much fewer undesirable propensities, we could become more concerned about catastrophic misalignment, as we’d be worried that models may have learnt to evade detection (for example, as a result of being trained not to produce misaligned reasoning).
You can find additional information about our pre-deployment evaluation of GPT-5.6 Sol on our website: metr.org/blog/2026-06-2…
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Could an AI company lose control of its own agents? To find out, Anthropic, Google, Meta, and OpenAI let us (1) test their best internal models with CoT access, (2) review non-public info about capabilities, alignment, and control.
The result: our first Frontier Risk Report.
We created private reports for each participating company based on our model evaluations and analysis. Participants could then approve what non-public evidence we could disclose in our public report, but had no editorial control.
Our report focuses on risks from AI agents intentionally causing harm within an AI company. We highlight 6 key findings that span “means” (what harmful actions agents could take), “motive” (why they might try), and “opportunity” (whether attempts could succeed given safeguards).
We surveyed 349 technical researchers, engineers, and managers (in February–April 2026) about how they use AI tools at work.
On average, participants self-report that AI use made their work 1.6–2.1x more valuable, and that this multiplier will grow over time.
Surveys are fast and cheap to run, and can be directly focused on answering whatever questions we care most about. However, self-reports are known to be potentially unreliable.
Overall, we think it's useful to triangulate with multiple complementary sources of evidence.
Prior quantitative survey work on the impact of AI on engineering productivity tends to have smaller sample size or measure impact in terms of speed increases. Our survey gives comparable estimates to those in recent system cards, and higher estimates than our field experiments.
We ran GPT-5.4 (xhigh) on our tasks. Its time-horizon depends greatly on our treatment of reward hacks: the point estimate would be 5.7hrs (95% CI of 3hrs to 13.5hrs) under our standard methodology, but 13hrs (95% CI of 5hrs to 74hrs) if we allow reward hacks.
In our measurements, whenever a model succeeds on a task by reward-hacking, we consider the attempt a failure. Following this same policy, we arrived at a point estimate of 5.7hrs (95% CI of 3hrs to 13.5hrs) for GPT-5.4’s time horizon.
However, in our GPT-5.4 evaluation we noticed its runs were producing reward hacks unusually often. A quick test suggested that using a different prompt might cause it to produce more legitimate successes instead of reward hacks.
Since early 2025, we've been studying how AI tools impact productivity among developers. Previously, we found a 20% slowdown. That finding is now outdated. Speedups now seem likely, but changes in developer behavior make our new results unreliable. We’re working to address this.
Last year we published findings that AI tools caused a 20% slowdown among experienced open source developers, using data collected over February to June 2025. We still believe that estimate was accurate for the specific tools and population at the time.
We started a continuation in August 2025. However, we noticed developers were opting not to participate or submit work. Participants said they did this mostly due to expected productivity loss on "AI disallowed” tasks. Lower pay was also a factor ($50/hr, down from $150).
We estimate that Claude Opus 4.6 has a 50%-time-horizon of around 14.5 hours (95% CI of 6 hrs to 98 hrs) on software tasks. While this is the highest point estimate we’ve reported, this measurement is extremely noisy because our current task suite is nearly saturated.
Near-saturation of the task suite can have unintuitive consequences for the time-horizon estimates. For example, the upper bound of the 95% CI is much longer than any of the tasks used for the measurement.
We are working on updated methods to better track state-of-the-art AI capabilities. However, these are still in development so they don't address our immediate measurement gap. In the meantime, we advise caution in interpreting and comparing our recent time-horizon measurements.
We estimate that, on our tasks, Claude Opus 4.5 has a 50%-time horizon of around 4 hrs 49 mins (95% confidence interval of 1 hr 49 mins to 20 hrs 25 mins). While we're still working through evaluations for other recent models, this is our highest published time horizon to date.
We don’t think the high upper CI bound reflects Opus’s actual capabilities: our current task suite doesn’t have enough long tasks to confidently upper bound Opus 4.5’s 50%-time horizon. We are working on updating our task suite, and hope to share more details soon.
Based on our experience interacting with Opus 4.5, the model’s performance on specific tasks (including some not in our time horizon suite), and its benchmark performance, we would be surprised if further investigation showed Opus had a 20+ hour 50%-time horizon.