I found it correctly answers unknowable events in Oct, Nov, and even Dec 11th & 19th.
In late Dec it begins to abstain.
2/
Interestingly, GPT 3.5 "Default" answers correctly only until ~Oct 24, 2021, but GPT 3.5 "Legacy" answers correctly until ~Oct 31, 2021 then begins hallucinating false answers or abstaining in Nov.
Perhaps this is due to finetuning rather than pretraining data?
3/
@mentions@AnthropicAI's Claude v1.2 model correctly answers questions July 11, Aug 12, Sept 26, Oct 10 but abstains at Oct 9 & Nov 2.
➡️The trick with Claude is to ask it about an event without telling it the date (see examples).
4/
@mentions@CohereAI's Command XL Nightly provides the most recent correct answers of the 3 models! 🌟
✅It correctly answers Qs in March 9 & April 24, 2022 but hallucinates May onwards.
❌It does not seem to abstain from answering future info it doesn't know, like the others.
What are 3 concrete steps that can improve AI safety in 2025? 🤖⚠️
Our new paper, “In House Evaluation is Not Enough” has 3 calls-to-action to empower independent evaluators:
1️⃣ Standardized AI flaw reports
2️⃣ AI flaw disclosure programs + safe harbors.
3️⃣ A coordination center for transferable AI flaws affecting many systems.
1/🧵
🌟Motivation🌟
Today, GPAI serves 300M+ users globally, w/ diverse & unforeseen uses across modalities and languages.
➡️ We need third-party evaluation for its broad expertise, participation and independence, including from real users, academic researchers, white-hat hackers, and journalists.
2/
However, third-party evaluation currently faces key barriers:
✨New Preprint ✨ How are shifting norms on the web impacting AI?
We find:
📉 A rapid decline in the consenting data commons (the web)
⚖️ Differing access to data by company, due to crawling restrictions (e.g.🔻26% OpenAI, 🔻13% Anthropic)
⛔️ Robots.txt preference protocols are ineffective
These precipitous changes will impact the availability and scaling laws for AI data, affecting coporate developers, but also non-profit and academic research.
A wave of new work shows how **brittle** "Alignment"/RLHF safety methods are.
⛓️ Prompt jailbreaks are easy
🚂 Finetuning away safety (even #OpenAI API) is simple and likely undetectable
🤖 LLMs can auto-generate their own jailbreaks...
1/ 🧵
It's been repeatedly shown that careful prompt re-wording, roleplaying, and even just insisting can jailbreak Llama2-Chat/#ChatGPT usage policy ().
, @AIPanicLive document many jailbreak / red teaming efforts