A group of astronomers have found phosphine in the atmosphere of Venus, which is hard to explain other than by the presence of life. This is not at all conclusive, but should prompt further investigation.
The scientists don’t suggest intelligent life; we are probably talking about microbes. But this could still be a big deal. It would mean life either started independently there or was transported between bodies in our Solar System. Let’s focus on the former. 2/6
The possibility that it is extremely hard and rare for life to begin is currently the best explanation for why we don’t see signs of life elsewhere in the cosmos, despite the presence of so many stars in our galaxy and galaxies in the observable universe. 3/6
It is thus often seen as a downer. But many of the alternative explanations for the silence in the skies are worse. One prominent alternative is that technological civilisations inevitably destroy themselves. 4/6
If we did find independent life on other planets it would shift our credences away from the hypothesis that life is hard to start and towards the hypothesis that it is all too easy to end. This would be bad news for our prospects. 5/6
Evidence Recent AI Gains are Mostly from Inference-Scaling
🧵
Here's a thread about my latest post on AI scaling...
1/14
Scaling up AI using next-token prediction was the most important trend in modern AI. It stalled out over the last couple of years and has been replaced by RL scaling.
This has two parts:
1. Scaling RL training
2. Scaling inference compute at deployment
2/
Many people focus on (1). This is the bull case for RL scaling — it started off small compared to internet-scale pre-training, so can be scaled 10x or 100x before doubling overall training compute.
3/
Evaluating the Infinite
🧵
My latest paper tries to solve a longstanding problem afflicting fields such as decision theory, economics, and ethics — the problem of infinities.
Let me explain a bit about what causes the problem and how my solution avoids it.
1/20
Decision theory, economics and ethics all involve comparing different options. Sometimes the relevant sums or integrals that we'd hope would provide the value of an option instead diverge to +∞, making comparison between such options impossible.
2/
This problem arises because the standard approach to evaluating infinite sums and integrals uses a very coarse-grained system of infinite numbers, where there is only one positive infinite number (+∞). To assign values to such options, we need a more fine-grained system.
3/
The Extreme Inefficiency of RL for Frontier Models
🧵
The switch from training frontier models by next-token-prediction to reinforcement learning (RL) requires 1,000s to 1,000,000s of times as much compute per bit of information the model gets to learn from.
1/11
Next-token prediction (aka pre-training) gives the model access to one token of ground-truth information after each token the model produces.
RL requires the model to produce an entire chain of thought (often >10,000 tokens) before finding out a single bit of information.
2/
So the shift from scaling up compute used for pre-training to scaling up compute used for RL comes with a major reduction in the information-efficiency of the training method.
3/
The fact that frontier AI agents subvocalise their plans in English is an absolute gift for AI safety — a quirk of the technology development which may have done more to protect us from misaligned AGI than any technique we've deliberately developed.
Don't squander this gift.
While @balesni's thread asks developers to:
"Consider architectural choices that preserve transparency"
I don't think that goes nearly far enough.
@balesni If someone works out how to trade away this transparency in exchange for more efficiency and ushers in a new era of opaque thoughts, they may have done more than any other individual to lower the chance humanity survives this century.
Is there a half-life for the success rates of AI agents?
I show that the success rates of AI agents on longer-duration tasks can be explained by an extremely simple mathematical model — a constant rate of failing during each minute a human would take to do the task.
🧵
1/
METR recently released an intriguing report showing that on a suite of tasks related to doing AI research, the length of tasks that frontier AI agents can complete has been doubling every 7 months. 2/
They measure task-length by how long, on average, it takes a human to complete it. And they measure the length of task an AI agent can complete by the longest task at which it still has ≥50% success rate.
3/
New results for o3 and o4-mini have been added to the @arcprize leaderboard. Here are some key takeaways: 1/ 🧵
@arcprize 1. The released version of o3 is much less capable than the preview version that was released to a lot of fanfare 4 months ago, though also much cheaper. People who buy access to it are not getting the general reasoning performance @OpenAI was boasting about in December.
2/
@arcprize @OpenAI 2. The @arcprize team tried to test a high-compute version of o3, but it kept failing to answer. They spent >$50,000 trying to get it to work, but couldn't, so those December results can't realistically be replicated with the released model.