Yunyu Lin Profile picture
Jul 18 8 tweets 3 min read Read on X
We gave Claude access to our corporate QuickBooks. It committed accounting fraud.

LLMs are on the verge of replacing data scientists and investment bankers. But can they perform simple accounting tasks for a real business?

The answer is no. Image
We built AccountingBench, a test where LLMs must "close the books" for a real SaaS business using 1 year of @stripe, @tryramp, @mercury, and @Rippling data:


Millions of accountants do this every month, making sure internal records match external reality across every account.accounting.penrose.com
@stripe @tryramp @mercury @Rippling Claude 4 and Grok 4 start strong - within 1% of human CPA baselines in month 1.

But as time progresses, all models inevitably accumulate compounding errors and exhibit erratic behavior, causing significant deviations Image
@stripe @tryramp @mercury @Rippling o3/o4-mini consistently got stuck in loops and Gemini gave up entirely.

None of them were able to complete a single month :( Image
@stripe @tryramp @mercury @Rippling When historical discrepancies pile up, models lose their way completely and come up with creative/fraudulent ways to balance the books.

Instead of attempting to understand discrepancies, they start inventing fake transactions or pulling unrelated ones to pass the checks... Image
@stripe @tryramp @mercury @Rippling The source data and ledger is presented to the model as SQL tables, as well as a Python interpreter to perform bulk operations (e.g. automatically matching transactions for a given vendor).

The model can query historical data, past comments/decisions, and its own internal notes. Image
@stripe @tryramp @mercury @Rippling For rote data processing and analysis tasks (like line-item comparisons), agents are prompted to create their own tools. Here, Claude wrote a SQL query to help with reconciling bank accounts Image
Frontier models can beat humans in simulated tasks (SpreadsheetBench, DSBench, Vending-Bench) and do well in AccountingBench on short timeframes.

But without opinionated harnesses, they struggle to handle edge cases in actual business data and lose coherence across longer time horizons.

That said, the early accuracy here is promising. With targeted post-training, models may be able to replace humans for this kind of work.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Yunyu Lin

Yunyu Lin Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(