Luis Villa Profile picture
Nov 3 31 tweets 6 min read
Gonna do me some #lawtwitter and live-read this. Let’s go.
1/n: Matthew Butterick’s fonts remain excellent, and it’s a pleasure to the eyes to read a brief set in them. This seems trite—but it’s important, because so many lawyers tell you “we pay attention to the details” and “we are master communicators” and then use Arial.
If they’re fucking up the very basics of typography, what else are they getting wrong? (pro tip: probably a lot, we’re not nearly as detail-oriented as we like to pretend.)
3/n (For example, it took me exactly two tweets to stop doing #/n)
4/n: This is filed as a class action, in US federal court. I am not an expert in federal class action certification, but that’s going to matter a lot so let me dive in a little bit.
5/n: The theory of class actions is that in some cases it’s really hard to get enough people together to make a lawsuit against a well-funded entity worthwhile. Key to that: they have to be “similarly situated”; i.e., they have the same grievances. law.cornell.edu/rules/frcp/rul…
6/n: This makes the definition of the class (who is in? who is out?) important. That’s on p. 8. TLDR: anyone who put open code under the most popular licenses on GitHub since 2015.
7/n: Again, I’m not a class action expert. But the state of the art in defending against GPL infringement lawsuits is to point out all the authors who haven’t joined, either because they don’t care or because they actively oppose enforcement.
8/n: I suspect finding such folks is going to be *very* easy in this case, which may pose problems to class certification. That will be particularly true because…
9/n:… the list of licenses (per the class statement) is all licenses suggested by the GitHub license chooser, including mostly *permissive* licenses as well as multiple licenses with explicit fair use clauses (*GPL* 3 and MPL 2).
10/n: I skipped over a lot of the preliminary throat clearing, but probably worth noting that defendants are both GH (duh) but also OpenAI. Microsoft has a very seasoned legal team; I have no idea about OpenAI but we’re likely all about to find out.
11/n: The first argued damage is under the DMCA. This doesn’t come up much in open, but DMCA‘s prohibition on removing “copyright information” has been read very broadly by some courts to include even non-DRM watermarks. Backgrounder from 2014 here: meta.wikimedia.org/wiki/Wikilegal…
12/n: (my respect for those who regularly live-tweet legal docs is going up by the moment. My patience for this is quickly wearing thin, especially on this lousy ipad keyboard and with school pickup looming…)
13/n: “the Output is often a near-identical reproduction of code from the training data”. That’s a load-bearing “often”. It’ll be interesting, if this gets to discovery, to see how often this happens as a fraction of the generated code. ∞ monkeys + ∞ keyboards -> John Carmack?
14/n: “Every instance of Output … is derived from material in its Training Data”. Analogies often win (or lose) tech cases. A key battle here is going to be how each side explains what it means to “derive” one work from another here.
15/n: it’s certainly true in a very general sense that Copilot outputs are derived from Codex inputs, but in a way that no court has ever really seen before, and (arguably) not in a way that the copyright statute intends when it speaks of derivatives. copyright.gov/circs/circ14.p…
16/n: p. 15-16, the filing confidently attributes some (fairly short) Codex-sourced code to a specific book. I would be curious to see how many repos in GH contain this exact code, with and without MIT-licensed attribution.
17/n: It’s certainly possible this is the only person who has ever written this function in this way, but I’ve seen at least one confidently-asserted “Copilot copied my stuff” that turned out to widely-copied (or widely-independently-invented?) pre-Copilot.
18/n: (and in Oracle-Google we found a bunch of allegedly copied code that… Sun had donated to Apache.)
18/n: I’m also curious how many function names in the same book lead to literal copying. This doesn’t excuse literal copying, but the optics are interestingly different if this is common or if 99% of the book’s function headers result in non-copying.
19/n: (Note here that this is a good reason why documentation and textbook code-snippets should always be licensed CC0 or similar: even a very mild permissive license may unintentionally be entrapping students who copy code to help themselves learn.)
20/n: paragraphs 54 and 55 go into how an example of emitted code is not high quality code. This is almost certainly true but… I’m not clear what bearing it has? Perhaps explained later?
21/n: as pointed out 👇🏽“near identical” is doing a lot of work, and indeed in paragraph 60 we lead off with a handwaving of “different line breaks”. Different line breaks shouldn’t be dispositive, but GH is going to point at a lot of differences.
22/n: To elaborate a bit on that one: of course you can’t avoid a copyright violation just by running a linter with different formatting flags. And at this stage of the litigation, you don’t need perfect proof. But it’s still odd.
23/n: “Ultimately, Codex derives its value primarily from its ability to locate and output potentially useful Licensed Materials”. Boooy. I am of two minds here:
24/n: One, there’s a lot more going on here than just “locate and output”. It’s smart litigation to downplay the generation of new code, but under Google-Oracle and Google Book Search, once a court digs in it sure feels like quintessentially transformative fair use.
25/n: But two: they’re (so far - I’m only on p18 of 56) not really relying on this being copyright infringing at all. If that sticks, the fair use argument *doesn’t matter*, since fair use is not a defense to a DMCA copyright management infringement claim.
26/n: (This, by the way, is one of the ways the DMCA sucks. Anyone who spent the early 00s touting how the DMCA is evil, because it doesn’t protect fair use, should probably tread carefully around this suit.)
27/n: 🛑 I need to head home from this coffee shop before my wrists break and my kiddo gets home from school. Maybe more tonight, or maybe I’ll chuck all my computing devices into the Bay, or maybe @cdgrams will lock me to a blog post editor instead of twitter :)
[After ride home, am considering deleting this thread. I don’t have time/energy to do this justice right now, and it’s an important case that needs some thoughtful coverage.]

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Luis Villa

Luis Villa Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @luis_in_brief

Nov 3
Wrote today on the work of Elinor Ostrom (of @Ostrom_Workshop and Nobel fame) and what we can/can’t learn from her in open. tldr: our commons of *code* have never been good matches for her work, but our commons of *developer time* is—and that points at some warning signs.
🧵version: Ostrom uncovered design principles that made commons more resilient over long time periods, by studying 900+ IRL commons. She was not dogmatic about the rules: in different circumstances the humans in a system could/would adjust. And adjust open has!
Specifically, we ignore many of the Ostrom principles! And yet, open source has (mostly) flourished anyway. In this essay, I try to unpick that, by showing how the differences between code and (say) a river, explain which principles we could get away with ignoring.
Read 9 tweets
Nov 1
There’s a lot going on in the RAIL licenses—responding to tech changes; cultural changes; and having the unavoidable apps that most new licenses can only dream about. I tried to put down some of my thoughts on them here 👇🏽
The license is very clearly not Open Source. And that means companies should use models licensed under it with some trepidation; the essential *predictability* of traditional open is missing.
But it’s not open for a reason—the AI community is deeply concerned with the ethical implications of its work 👇🏽 If what you’re offering as an alternative is “ethics means anything goes”, they’re going to ignore you.
Read 8 tweets
Jun 30, 2021
Since lots of people are asking overnight, here’s my top-level take on Copilot and IP. Will collect more comments from overnight in a thread here.
I should preface this by saying that this is all new law, based on analogy to some very old things, and rarely tested in court (because, in programmer terms, there is no CI and each test framework execution is 💸. So lawyers are necessarily making some guesses when discussing it.
Read 6 tweets
Feb 5, 2021
Have been thinking a lot about the gap between what people want to be possible, and what is possible, when using the technology of copyright licensing. This paper feels very timely.
The ODbL drafting group was very optimistic about making copyleft possible in data, but we did not think pessimistically enough about the challenges; or perhaps we did but did not communicate them well enough to the client.
(Certainly in license drafting, as in software coding I suppose, clients have every incentive to find the lawyer who will say “yes, I can do that”. We have some institutional and ethical checks on that... but they can only do so much in the face of good-faith optimism.)
Read 11 tweets
Feb 3, 2021
Next time I write about ethical licenses, the NC licenses as a proto-ethical license would be a fruitful topic.
Other future blog topics: Do we still expect “mass gathering of innovators and participants” primarily (or even in part) by dint of an open source license? Data clearly says no (median # of participants on both SourceForge and GH: 1), but...
... one wonders if the dev community is now large enough (and the downsides of large-but-sub-IPO popularity bad enough) that “mass gathering” is the wrong goal, and instead healthy, ”small“ (still bigger than orig Apache!) communities is a better target?
Read 12 tweets
Feb 3, 2021
Hot take: the @ml5js license draft is roughly infinitely more interesting than Elastic/Mongo/SSPL.
There are some things I have learned from the furor around SSPL, that inform how I will think in a healthy, coherent way about evaluating new licenses. Blog post tomorrow about this, based on my post from last week: blog.tidelift.com/so-you-want-to…
But the arguments around network or non-compete licenses, and to a lesser extent the license text itself, are reasonably well-trod turf at this point. (Not *completely* old; Elastic, by virtue of size and influence, does merit attention as an industry bellwether!)
Read 6 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(