Anthropic Profile picture
Oct 22, 2024 9 tweets 3 min read Read on X
Introducing an upgraded Claude 3.5 Sonnet, and a new model, Claude 3.5 Haiku. We’re also introducing a new capability in beta: computer use.

Developers can now direct Claude to use computers the way people do—by looking at a screen, moving a cursor, clicking, and typing text. A benchmark comparison table showing performance metrics for multiple AI models including Claude 3.5 Sonnet (new), Claude 3.5 Haiku, GPT-4o, and Gemini models across different tasks.
The new Claude 3.5 Sonnet is the first frontier AI model to offer computer use in public beta.

While groundbreaking, computer use is still experimental—at times error-prone. We're releasing it early for feedback from developers.
We've built an API that allows Claude to perceive and interact with computer interfaces.

This API enables Claude to translate prompts into computer commands. Developers can use it to automate repetitive tasks, conduct testing and QA, and perform open-ended research.
We're trying something fundamentally new.

Instead of making specific tools to help Claude complete individual tasks, we're teaching it general computer skills—allowing it to use a wide range of standard tools and software programs designed for people.
Claude 3.5 Sonnet's current ability to use computers is imperfect. Some actions that people perform effortlessly—scrolling, dragging, zooming—currently present challenges. So we encourage exploration with low-risk tasks.

We expect this to rapidly improve in the coming months.
Even while recording these demos, we encountered some amusing moments. In one, Claude accidentally stopped a long-running screen recording, causing all footage to be lost.

Later, Claude took a break from our coding demo and began to peruse photos of Yellowstone National Park.
Beyond computer use, the new Claude 3.5 Sonnet delivers significant gains in coding—an area where it already led the field.

Sonnet scores higher on SWE-bench Verified than all available models—including reasoning models like OpenAI o1-preview and specialized agentic systems. A comparison table showing benchmark results for Claude 3.5 Sonnet (new) versus Claude 3.5 Sonnet, GPT-4o, and Gemini 1.5 Pro across various tasks like reasoning, coding, and math.
Claude 3.5 Haiku is the next generation of our fastest model.

Haiku now outperforms many state-of-the-art models on coding tasks—including the original Claude 3.5 Sonnet and GPT-4o—at the same cost as before.

The new Claude 3.5 Haiku will be released later this month. A comparison table showing benchmark results for Claude 3.5 Haiku versus Claude 3 Haiku, GPT-4o mini, and Gemini 1.5 Flash across various tasks like reasoning, coding, and math.
We believe these developments will open up new possibilities for how you work with Claude, and we look forward to seeing what you'll create.

Read the updates in full: anthropic.com/news/3-5-model…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Anthropic

Anthropic Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @AnthropicAI

Feb 3
New Anthropic Fellows research: How does misalignment scale with model intelligence and task complexity?

When advanced AI fails, will it do so by pursuing the wrong goals? Or will it fail unpredictably and incoherently—like a "hot mess?"

Read more: alignment.anthropic.com/2026/hot-mess-…
A central worry in AI alignment is that advanced AI systems will coherently pursue misaligned goals—the so-called “paperclip maximizer.”

But another possibility is that AI takes unpredictable actions without any consistent objective.
We measure this “incoherence” using a bias-variance decomposition of AI errors.

Bias = consistent, systematic errors (reliably achieving the wrong goal).
Variance = inconsistent, unpredictable errors.

We define incoherence as the fraction of error from variance. Image
Read 8 tweets
Jan 29
AI can make work faster, but a fear is that relying on it may make it harder to learn new skills on the job.

We ran an experiment with software engineers to learn more. Coding with AI led to a decrease in mastery—but this depended on how people used it.
anthropic.com/research/AI-as…
In a randomized-controlled trial, we assigned one group of junior engineers to an AI-assistance group and another to a no-AI group.

Both groups completed a coding task using a Python library they’d never seen before. Then they took a quiz covering concepts they’d just used. Image
Participants in the AI group finished faster by about two minutes (although this wasn’t statistically significant).

But on average, the AI group also scored significantly worse on the quiz—17% lower, or roughly two letter grades. Image
Read 7 tweets
Jan 26
New research: When open-source models are fine-tuned on seemingly benign chemical synthesis information generated by frontier models, they become much better at chemical weapons tasks.

We call this an elicitation attack. Image
Current safeguards focus on training frontier models to refuse harmful requests.

But elicitation attacks show that a model doesn't need to produce harmful content to be dangerous—its benign outputs can unlock dangerous capabilities in other models. This is a neglected risk.
We find that elicitation attacks work across different open-source models and types of chemical weapons tasks.

Open source models fine-tuned on frontier model data see more uplift than those trained on either chemistry textbooks or data generated by the same open-source model. Image
Read 6 tweets
Jan 21
We’re publishing a new constitution for Claude.

The constitution is a detailed description of our vision for Claude’s behavior and values. It’s written primarily for Claude, and used directly in our training process.
anthropic.com/news/claude-ne…
We’ve used constitutions in training since 2023. Our earlier approach specified principles Claude should follow; later, our character training emphasized traits it should have.

Today’s publication reflects a new approach.
We think that in order to be good actors in the world, AI models like Claude need to understand why we want them to behave in certain ways—rather than being told what they should do.

Our intention is to teach Claude to better generalize across a wide range of novel situations.
Read 7 tweets
Jan 19
New Anthropic Fellows research: the Assistant Axis.

When you’re talking to a language model, you’re talking to a character the model is playing: the “Assistant.” Who exactly is this Assistant? And what happens when this persona wears off? Left: Character archetypes form a "persona space," with the Assistant at one extreme of the "Assistant Axis." Right: Capping drift along this axis prevents models (here, Llama 3.3 70B) from drifting into alternative personas and behaving in harmful ways.
We analyzed the internals of three open-weights AI models to map their “persona space,” and identified what we call the Assistant Axis, a pattern of neural activity that drives Assistant-like behavior.

Read more: anthropic.com/research/assis…
To validate the Assistant Axis, we ran some experiments. Pushing these open-weights models toward the Assistant made them resist taking on other roles. Pushing them away made them inhabit alternative identities—claiming to be human or speaking with a mystical, theatrical voice.Examples of how open-weights models' responses change when they are steered away from the Assistant persona.
Read 8 tweets
Jan 15
We're publishing our 4th Anthropic Economic Index report.

This version introduces "economic primitives"—simple and foundational metrics on how AI is used: task complexity, education level, purpose (work, school, personal), AI autonomy, and success rates.
AI speeds up complex tasks more than simpler ones: the higher the education level to understand a prompt, the more AI reduces how long it takes.

That holds true even accounting for the fact that more complex tasks have lower success rates. Image
API data shows Claude is 50% successful at tasks of 3.5 hours, and highly reliable on longer tasks on .

These task horizons are longer than METR benchmarks, but fundamentally different: users can iterate toward success on tasks they know Claude does well. Claude.aiImage
Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(