Rob Bensinger ⏹️ Profile picture
Comms @MIRIBerkeley. RT = increased vague psychological association between myself and the tweet.
Dec 10, 2023 10 tweets 2 min read
@robertwiblin @TheZvi I think EA does have a non-small directional bias (relative to optimal performance) toward preferring legible, relatively-unmediated-by-theory lines of reasoning over mechanistic models, "inside views", etc.

Which I suspect yields the "neglect second-order effects" behavior. @robertwiblin @TheZvi It makes sense to be wary of pie-in-the-sky theories and to want "what if this is all BS?" sanity checks, but EA seems to me to go significantly too far in the opposite direction.
Apr 4, 2023 49 tweets 10 min read
I've been citing lesswrong.com/posts/uMQ3cqWD… to explain why the situation with AI looks doomy to me. But that post is relatively long, and emphasizes specific open technical problems over "the basics".

Here are 10 things I'd focus on if I were giving "the basics" on why I'm worried: 1. A common misconception is that the core danger is something murky about "agents" or about self-awareness.

Instead, I'd say that the danger is inherent to the nature of mental and physical action sequences that push the world toward some sufficiently-hard-to-reach state.
Mar 10, 2023 25 tweets 5 min read
Responding to someone who said he agreed with @AndrewYNg at the time that worrying about smarter-than-human AI was "like worrying about overpopulation on Mars", but now he thinks Mars is starting to fill up:

It really was a uniquely bad argument at the time. It pumps on a bunch of intuitions, without arguing for a single one of them:

- Overpopulation isn't a problem on Earth, but people panicked about it in the 1970s and after. By analogy, AI risk is supposed to be an inherently silly thing to worry about.
Feb 10, 2023 6 tweets 1 min read
Proposal: try to learn things about alignment by training models that ONLY output offensive content.

This tests exactly the same things as 'trying to get models to never say mainstream-offensive things', but makes it less likely alignment gets confused with 'make LLMs bland'. "AI alignment" / "Friendly AI" is actually AGI notkilleveryoneism. There's a genuine danger in equating "our chatbot didn't cause us to get sued or cancelled" with "we solved the alignment problem", or in blurring the lines between alignment and "AI ethics" or broad "AI safety".
Feb 10, 2023 5 tweets 1 min read
I'm not a big fan of the "takeoff" analogy for AGI. In real life, AGI doesn't need to "start on the ground". You can just figure out how to do AGI and find that the easy way to do AGI immediately gets you a model that's far smarter than any human. Less "takeoff", more "teleport". AGI capabilities can then "take off" from that point, but the takeoff begins from outer space, not from subhuman or par-human capability levels.

Inventing something involves a 0-to-1 leap at the point of going from "this doesn't work" to "this does work now".
Feb 4, 2023 11 tweets 2 min read
"hmm, that would involve coordinating numerous people—we may be arrogant enough to think that we might build a god-machine that can take over the world and remake it as a paradise, but we aren't delusional"

This, but unironically! Image Like, yes, point taken, this feels like a bizarre situation to be in. And I agree with lesswrong.com/posts/uFNgRumr… that there are sane ways to slow progress to some degree, which are worth pursuing alongside alignment work and other ideas to cause the long-term future to go well.
Dec 27, 2022 4 tweets 5 min read
@CryptoSecundus @RollinReisinger @JeffLadish @xlr8harder @LesaunH About as likely as the AI deciding to spare everyone who wears a blue hat.

Suppose that the AI's sole goal is to maximize the number of granite spheres in its future light cone. Would such an AI reward the humans that brought it into existence? @CryptoSecundus @RollinReisinger @JeffLadish @xlr8harder @LesaunH No, because rewarding humans in that situation is not the granite-sphere-maximizing action. Disassembling the humans in order to build more granite-sphere infrastructure is the granite-sphere-maximizing action in that case.
Dec 27, 2022 7 tweets 2 min read
Huh. I hadn't heard of "Washington's Rules of Civility" before and was expecting them to be really cool.

But reading them, I mostly just see in them a cowardly and lying culture — a decadent and hollow culture, paralyzed with self-abnegation, far more than what I see nowadays. Rules 1, 8, 20-23, 25, 32, 35-36, 38, 44-45, 49-50, 56, 58-59, 72-75, 80, 82, 86-89, and 110 are OK? But most of the rules are some version of "it is shameful to have an itchy elbow" or "be wary of speaking your mind if it might hurt anyone's feelings".
en.wikisource.org/wiki/The_Rules…
Dec 27, 2022 5 tweets 1 min read
Spending time explaining EA to my family over the holidays has helped remind me off why I think "do the Most! Possible! Good!" is a good concept, and a good rallying cry for a community.

At a glance, it's weird to have a community that has such crazily heterogeneous interests. But, in fact, it sure is bizarrely rare that groups of science-minded human beings explicitly try to justify their actions or strategies in terms of "do the Most! Possible! Good!"!

And it sure is nice to have a rallying cry that encourages cause neutrality / disloyalty.
Dec 15, 2022 7 tweets 2 min read
A surprising thing I've realized over time is that I can often outperform without being super clever, just by doing normal garden-variety thinking and not letting the thinking get derailed by [List of Tempting Distractions and Simple Mistakes]. Rather a lot of work is done by just following thoughts through to their conclusion, consistently applying "easy" reasoning methods, etc. Many unusual and important conclusions can be reached without your being sparklingly creative or anything.
Dec 14, 2022 15 tweets 6 min read
Thread for examples of alignment research MIRI has said relatively positive stuff about:

("Relatively" because our overall view of the field is that not much progress has been made, and it's not clear how we can change that going forward. But there's still better vs. worse.) Nate singles out John Wentworth's Natural Abstractions as a line of future research that "could maybe help address the core difficulty if it succeeds wildly more than I currently expect it to succeed". Ditto "sufficiently ambitious interpretability work".

lesswrong.com/s/v55BhXbpJuaE…
Oct 13, 2022 11 tweets 2 min read
A common failure mode for people who pride themselves in being foxes (as opposed to hedgehogs):

Paying more attention to easily-evaluated claims that don't matter much, at the expense of hard-to-evaluate claims that matter a lot. E.g., maybe there's an RCT that isn't very relevant, but is pretty easily interpreted and is conclusive evidence for some claim.

At the same time, maybe there's an informal argument that matters a lot more, but it takes some work to know how much to update on it...
Oct 13, 2022 4 tweets 1 min read
"Not dying is a good way to achieve a wide variety of goals" is not a suspicious convergence!

It would be suspicious if existential risk reduction *didn't* help with nearly every other goal! It would suggest that something fishy is going on, or we're doing something wrong. (Note that there's no a priori reason x-risk reduction has to be *tractable*. So x-risk reduction could be a bad way to help with the arts, because it's a bad idea in general.

But x-risk reduction should be extremely convergent relative to its tractability.)
Oct 13, 2022 4 tweets 2 min read
Jun 17, 2022 24 tweets 9 min read
A lot of the relative placements on that AGI political compass meme seemed very wrong to me, so here's one that does match my current impressions:

(My incredibly vague, amazingly low-confidence, June 17 2022 impressions.) Image I did a tiny bit of Googling, but a lot of the comparisons are very subjective, or based on guesswork, or based on info that's likely super out-of-date. Treat this like an untrustworthy rumor you heard someone casually toss out at a party, not like a distillation of knowledge.
May 16, 2022 12 tweets 2 min read
If the world is likeliest to be saved by sober scholarship, then let us be sober scholars in the face of danger. If the world is likeliest to be saved by playful intellectual exploration, then let us be playful in the face of danger.

Strategic, certainly; aware of our situation, yes; but let us not throw away the one mental mode that can actually save us, if that's in fact our situation.
May 14, 2022 5 tweets 1 min read
I think all of these things at once:

1. Compared to the rest of the world, effective altruism is absolutely goddamned amazing. It's remarkable and disturbing how rare the basic EA combination of traits is, and it suggests EA is something precious, to be protected and grown. (The EA combination of traits is things like: taking ideas seriously; assigning enormous weight to RCTs *and* weird philosophy arguments; not having a specific crazy philosophy commitment; numerate, utilitarian-style, not-biased-toward-inaction thinking in high-stakes dilemmas.)
May 11, 2022 7 tweets 2 min read
At the borderlands of EA and non-EA, I find that the main argument I tend to want to cite is Bayes:

'Yep, A seems possible. But if not-A were true instead, what would you expect to see differently? How well does not-A retrodict the data, compared to A?' And relatedly, 'What are the future predictions of A versus not-A, and how soon can we get data that provides nontrivial evidence for one side versus the other?' But that's a more standard part of the non-EA college-educated person's toolbox.