Post

Yunyu Lin

@yunyu_l

Jul 18, 2025 • 8 tweets • 3 min read • Read on X

Scrolly

We gave Claude access to our corporate QuickBooks. It committed accounting fraud.

LLMs are on the verge of replacing data scientists and investment bankers. But can they perform simple accounting tasks for a real business?

The answer is no.

We built AccountingBench, a test where LLMs must "close the books" for a real SaaS business using 1 year of @stripe, @tryramp, @mercury, and @Rippling data:

Millions of accountants do this every month, making sure internal records match external reality across every account.accounting.penrose.com

@stripe @tryramp @mercury @Rippling Claude 4 and Grok 4 start strong - within 1% of human CPA baselines in month 1.

But as time progresses, all models inevitably accumulate compounding errors and exhibit erratic behavior, causing significant deviations

@stripe @tryramp @mercury @Rippling o3/o4-mini consistently got stuck in loops and Gemini gave up entirely.

None of them were able to complete a single month :(

@stripe @tryramp @mercury @Rippling When historical discrepancies pile up, models lose their way completely and come up with creative/fraudulent ways to balance the books.

Instead of attempting to understand discrepancies, they start inventing fake transactions or pulling unrelated ones to pass the checks...

@stripe @tryramp @mercury @Rippling The source data and ledger is presented to the model as SQL tables, as well as a Python interpreter to perform bulk operations (e.g. automatically matching transactions for a given vendor).

The model can query historical data, past comments/decisions, and its own internal notes.

@stripe @tryramp @mercury @Rippling For rote data processing and analysis tasks (like line-item comparisons), agents are prompted to create their own tools. Here, Claude wrote a SQL query to help with reconciling bank accounts

Frontier models can beat humans in simulated tasks (SpreadsheetBench, DSBench, Vending-Bench) and do well in AccountingBench on short timeframes.

But without opinionated harnesses, they struggle to handle edge cases in actual business data and lose coherence across longer time horizons.

That said, the early accuracy here is promising. With targeted post-training, models may be able to replace humans for this kind of work.

• • •

Missing some Tweet in this thread? You can try to force a refresh

Share this page!

Enter URL or ID to Unroll

Yunyu Lin

Try unrolling a thread yourself!

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!