Excited to share my newest paper, "Neural Language Models are Effective Plagiarists" with @EdwardRaffML. We took a dataset of CS 101 assignments and asked "can a language model do a good job solving these with minimal human intervention or knowledge?"

arxiv.org/abs/2201.07406
@EdwardRaffML There's been some very interesting work recently on solving college level assignments with transformers, but that work typically uses private models and more complicated pipelines. We wanted to focus on what was available to a random student with the internet, not an AI expert.
@EdwardRaffML To do that, we stuck with #EleutherAI's GPT-J, freely and publicly available at 6b.eleuther.ai. We used no prompting, no finetuning, and no tricks.
We just entered the text of the assignment into the website several times and then tried to run the results in a Java compiler. We fixed minor errors that were solvable by reading the assignment text or by Googling the errors the code spit out, but that's it.
The solutions that GPT-J produced were often wrong. But they were also often correct. It only took a couple minutes to obtain a piece of code that correctly solved the problems we gave it, even with minor bug fixing and using a public demo API.
The next question we turned to was whether these solutions would pass a plagiarism screening with the widely used MOSS software. We took a dataset of assignments that contained both original and plagiarized solutions to compare to.
In this image, different levels = different plagiarism techniques. We find that MOSS rates GPT-J's outputs as being as different or even more different from the original solution than the non-plagiarisms and the plagiarisms produced with the most advanced techniques.
But even if GPT-J's solutions are different from the reference solution, these plagiarisms might be detected if they're too internally similar. To address this, we applying ISOMap to the pairwise MOSS similarity scores. We find quite low inter-cluster similarity for GPT-J's code
Black and green are original solutions, red and yellow are plagiarisms of the black dot specifically (yellow = more sophisticated plagiarism), and GPT-J is blue. Different shapes represent different assignments.
We also looked at memorization: is this simply the result of the code being memorized by the transformer? We find that the answer is a resounding NO: no correct solutions are verbatim copies of text from the training data.
While examining prompting is outside the scope of this paper, we do take a look at one particularly noteworthy aspect: how specifying the programing language matters. Notice that GPT-J will sometimes give code in the wrong language, even when told what language to use.
Our work leaves a lot of exciting avenues for future work including:
- using professors to rate the plagiarisms
- using actual students to find the solutions (In my defense, I know zero Java)
- using harder assignments (ours were limited by needing plag. data)
If you want to go read through the generations, you're more then welcome to! We included the full text of all 15 generations for all 7 assignments in the appendix.
@EdwardRaffML If you're interested in my thoughts about training and evaluating LLMs after writing this paper, check out the thread here:

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Stella Rose Biderman

Stella Rose Biderman Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @BlancheMinerva

Oct 11, 2021
@MSFTResearch and @NVIDIAAI announce a 540B parameter large language model, 3x larger than GPT-3, achieving superior results on a variety of tasks. Trained on the Pile and evaluated on the Eval Harness, two of #EleutherAI’s biggest projects.

A 🧵

developer.nvidia.com/blog/using-dee…
@MSFTResearch @NVIDIAAI The Pile is a curated dataset of high quality data for training language models. The project was lead by @nabla_theta and myself, with contribs from many others. Released on Jan 1st 2021, it was the first public massive language model training dataset

@MSFTResearch @NVIDIAAI @nabla_theta The 530B model is trained predominantly on the Pile, with a couple newer CC scrapes mixed in. The "newer" facet is quite important, as the data in the Pile was collected prior to July 31st, 2020. Any events that happened since that date (most notably the COVID pandemic)
Read 32 tweets
Aug 23, 2021
Okay, time to live tweet my thoughts on @stanfordnlp @StanfordAILab's "Workshop on Foundation Models." A long thread.
First and foremost: please never use the phrase "foundational models" every again. It's a garbage name that people like @mmitchell_ai @emilymbender @mer__edith have criticized at length. I'll go find some of their comments and link to them later, but the short version is:
@mmitchell_ai @emilymbender @mer__edith 1. There is very little intellectually "foundational" about these models
2. It's not at all clear that GPT-3 and CLIP-DALL-E are the same kind of thing
3. The motivation for this relabeling appears to be entirely about political control over language
Read 60 tweets
Jul 3, 2021
Phenomenally interesting paper about how AI researchers talk about what they value in their research. Very glad the authors took the time to do this laborious but important work. I'm going to keep this in my desk so the next time I go on a rant about how ML is prescriptive [1/?]
rather than descriptive I can wack people who disagree with this paper 😛

I would actually go further than the authors of this paper do (I don't know if they disagree with what I'm about to say, but they didn't say it): I would say that corporate AI research [2/?]
is a propaganda tool that is actively and deliberately wielded to influence policy, regulation, and ethics conversations about technology. The very way mainstream AI research - even "AI Ethics" research - is framed obliviates consequences for the companies. [3/?]
Read 26 tweets
Jul 2, 2021
Great write up about the crazy cool art #EleutherAI members have been learning to coax out of GANs with CLIP! Credit assignment with stuff like this is hard, but @jbusted1 @RiversHaveWings @BoneAmputee and @kialuy are some of the people who have made this happen.
@jbusted1 @RiversHaveWings @BoneAmputee @kialuy They’ve been doing some visionary work with human-guided AI-generated art for the past two months, and it’s phenomenal that they’re starting to get the recognition they deserve. Several more people who either lack twitters or whose handles I don’t know deserve applause too
Read 9 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(