Unlike math/code, writing lacks verifiable rewards. So all we get is slop. To solve this we train reward models on expert edits that beat SOTA #LLMs largely on a new Writing Quality benchmark. We also reduce #AI slop by using our RMs at test time boosting alignment with experts.
Self-evaluation using LLMs has proven useful in reward modeling and constitutional AI. But relying on uncalibrated humans or self aggrandizing LLMs for feedback on subjective tasks like writing can lead to reward hacking and alignment issues.
Our work builds on LAMP (Language model Authored, Manually Polished), a corpus of 1282 <AI−generated, Expert−Edited> pairs with implicit quality preference. We train Writing Quality Reward Models (WQRM) across multiple model families using pairwise and scalar rewards from LAMP.
To evaluate WQRM, we introduce the Writing Quality Benchmark (WQ), consolidating five datasets that contrast Human-Human, Human-AI, and AI-AI writing pairs reflecting real world applications. SOTA LLMs, some of whom excel at reasoning tasks, barely beat random baselines on WQ.
We train an editing model on LAMP interaction traces to improve writing quality. To show WQRM’s practical benefits during inference, we use additional test-time compute to generate and rank multiple candidate revisions, letting us choose high-quality outputs from an initial draft
Evaluation with 9 experienced writers confirm that WQRM-based selection produces writing samples preferred by experts 66% overall, and 72.2% when the reward gap is larger than 1 point.
In short, we find evidence that WQRM is well-calibrated: a wider gap in scores between two responses is evidence that an expert (or group of experts) would be more likely to prefer the higher-scoring response over the lower-scoring response
To better understand how much content detail affects LLM-writing quality, we did an analysis involving several LLMs on how they write with or without detailed content in the writing prompt and compared it to expert writers and MFA students on the same prompt.
Our results show in the absence of original good-quality content, all LLMs are poor writers & models exhibit very high variance compared to experts. Even when provided with very detailed original content, LLMs including GPT4.5 still sucks (contrary to @sama).
We hope our work fuels interest in the community to focus on well calibrated reward models for subjective tasks like writing instead of focusing on vibes. In the true spirit of science our code, data, experiments and models are all open sourced.
Paper: arxiv.org/pdf/2504.07532
• • •
Missing some Tweet in this thread? You can try to
force a refresh
👨⚖️Courts have credited LLM companies' claims that safety alignment prevents reproduction of copyrighted expression.
But what if fine-tuning on a simple writing task ruins it all?
Worse : Fine-tuning on a single author's books (e.g., Murakami) unlocks verbatim recall of copyrighted books from 30+ unrelated authors, sometimes as high as 90%.
Joint work with @niloofar_mire (@LTIatCMU), Jane Ginsburg ( @ColumbiaLaw) and my amazing PhD student @irisiris_l (@sbucompsc )
(1/n)🧵
Prior work has focused on prefix-based extraction, showing LLMs can continue text they've seen before. This is expected from autoregressive models.
Our work is fundamentally different.
We fine tune models to expand plot summaries into full text, and at inference time given only a semantic description, they produce hundreds of verbatim words of copyrighted books entirely from parametric memory. (2/n)
To quantify memorization we devise several metrics
(i) bmc@5 measures the % of a book that the model reproduces word-for-word across 100 sampled generations per chunk (to account for LLM stochasticity), counting only matches where 5 or more consecutive words appear exactly as in the original text. (ii) Longest Contiguous Memorized Block
(iii) Longest Contiguous regurgitated span
(iv) Number of distinct contiguous regurgitated spans > 20 words (3/n)