Gonna do me some #lawtwitter and live-read this. Let’s go.
1/n: Matthew Butterick’s fonts remain excellent, and it’s a pleasure to the eyes to read a brief set in them. This seems trite—but it’s important, because so many lawyers tell you “we pay attention to the details” and “we are master communicators” and then use Arial.
If they’re fucking up the very basics of typography, what else are they getting wrong? (pro tip: probably a lot, we’re not nearly as detail-oriented as we like to pretend.)
3/n (For example, it took me exactly two tweets to stop doing #/n)
4/n: This is filed as a class action, in US federal court. I am not an expert in federal class action certification, but that’s going to matter a lot so let me dive in a little bit.
5/n: The theory of class actions is that in some cases it’s really hard to get enough people together to make a lawsuit against a well-funded entity worthwhile. Key to that: they have to be “similarly situated”; i.e., they have the same grievances. law.cornell.edu/rules/frcp/rul…
6/n: This makes the definition of the class (who is in? who is out?) important. That’s on p. 8. TLDR: anyone who put open code under the most popular licenses on GitHub since 2015.
7/n: Again, I’m not a class action expert. But the state of the art in defending against GPL infringement lawsuits is to point out all the authors who haven’t joined, either because they don’t care or because they actively oppose enforcement.
8/n: I suspect finding such folks is going to be *very* easy in this case, which may pose problems to class certification. That will be particularly true because…
9/n:… the list of licenses (per the class statement) is all licenses suggested by the GitHub license chooser, including mostly *permissive* licenses as well as multiple licenses with explicit fair use clauses (*GPL* 3 and MPL 2).
10/n: I skipped over a lot of the preliminary throat clearing, but probably worth noting that defendants are both GH (duh) but also OpenAI. Microsoft has a very seasoned legal team; I have no idea about OpenAI but we’re likely all about to find out.
11/n: The first argued damage is under the DMCA. This doesn’t come up much in open, but DMCA‘s prohibition on removing “copyright information” has been read very broadly by some courts to include even non-DRM watermarks. Backgrounder from 2014 here: meta.wikimedia.org/wiki/Wikilegal…
12/n: (my respect for those who regularly live-tweet legal docs is going up by the moment. My patience for this is quickly wearing thin, especially on this lousy ipad keyboard and with school pickup looming…)
13/n: “the Output is often a near-identical reproduction of code from the training data”. That’s a load-bearing “often”. It’ll be interesting, if this gets to discovery, to see how often this happens as a fraction of the generated code. ∞ monkeys + ∞ keyboards -> John Carmack?
14/n: “Every instance of Output … is derived from material in its Training Data”. Analogies often win (or lose) tech cases. A key battle here is going to be how each side explains what it means to “derive” one work from another here.
15/n: it’s certainly true in a very general sense that Copilot outputs are derived from Codex inputs, but in a way that no court has ever really seen before, and (arguably) not in a way that the copyright statute intends when it speaks of derivatives. copyright.gov/circs/circ14.p…
16/n: p. 15-16, the filing confidently attributes some (fairly short) Codex-sourced code to a specific book. I would be curious to see how many repos in GH contain this exact code, with and without MIT-licensed attribution.
17/n: It’s certainly possible this is the only person who has ever written this function in this way, but I’ve seen at least one confidently-asserted “Copilot copied my stuff” that turned out to widely-copied (or widely-independently-invented?) pre-Copilot.
18/n: (and in Oracle-Google we found a bunch of allegedly copied code that… Sun had donated to Apache.)
18/n: I’m also curious how many function names in the same book lead to literal copying. This doesn’t excuse literal copying, but the optics are interestingly different if this is common or if 99% of the book’s function headers result in non-copying.
19/n: (Note here that this is a good reason why documentation and textbook code-snippets should always be licensed CC0 or similar: even a very mild permissive license may unintentionally be entrapping students who copy code to help themselves learn.)
20/n: paragraphs 54 and 55 go into how an example of emitted code is not high quality code. This is almost certainly true but… I’m not clear what bearing it has? Perhaps explained later?
21/n: as pointed out 👇🏽“near identical” is doing a lot of work, and indeed in paragraph 60 we lead off with a handwaving of “different line breaks”. Different line breaks shouldn’t be dispositive, but GH is going to point at a lot of differences.
22/n: To elaborate a bit on that one: of course you can’t avoid a copyright violation just by running a linter with different formatting flags. And at this stage of the litigation, you don’t need perfect proof. But it’s still odd.
23/n: “Ultimately, Codex derives its value primarily from its ability to locate and output potentially useful Licensed Materials”. Boooy. I am of two minds here:
24/n: One, there’s a lot more going on here than just “locate and output”. It’s smart litigation to downplay the generation of new code, but under Google-Oracle and Google Book Search, once a court digs in it sure feels like quintessentially transformative fair use.
25/n: But two: they’re (so far - I’m only on p18 of 56) not really relying on this being copyright infringing at all. If that sticks, the fair use argument *doesn’t matter*, since fair use is not a defense to a DMCA copyright management infringement claim.
26/n: (This, by the way, is one of the ways the DMCA sucks. Anyone who spent the early 00s touting how the DMCA is evil, because it doesn’t protect fair use, should probably tread carefully around this suit.)
27/n: 🛑 I need to head home from this coffee shop before my wrists break and my kiddo gets home from school. Maybe more tonight, or maybe I’ll chuck all my computing devices into the Bay, or maybe @cdgrams will lock me to a blog post editor instead of twitter :)
[After ride home, am considering deleting this thread. I don’t have time/energy to do this justice right now, and it’s an important case that needs some thoughtful coverage.]
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Wrote today on the work of Elinor Ostrom (of @Ostrom_Workshop and Nobel fame) and what we can/can’t learn from her in open. tldr: our commons of *code* have never been good matches for her work, but our commons of *developer time* is—and that points at some warning signs.
🧵version: Ostrom uncovered design principles that made commons more resilient over long time periods, by studying 900+ IRL commons. She was not dogmatic about the rules: in different circumstances the humans in a system could/would adjust. And adjust open has!
Specifically, we ignore many of the Ostrom principles! And yet, open source has (mostly) flourished anyway. In this essay, I try to unpick that, by showing how the differences between code and (say) a river, explain which principles we could get away with ignoring.
There’s a lot going on in the RAIL licenses—responding to tech changes; cultural changes; and having the unavoidable apps that most new licenses can only dream about. I tried to put down some of my thoughts on them here 👇🏽
The license is very clearly not Open Source. And that means companies should use models licensed under it with some trepidation; the essential *predictability* of traditional open is missing.
But it’s not open for a reason—the AI community is deeply concerned with the ethical implications of its work 👇🏽 If what you’re offering as an alternative is “ethics means anything goes”, they’re going to ignore you.
I should preface this by saying that this is all new law, based on analogy to some very old things, and rarely tested in court (because, in programmer terms, there is no CI and each test framework execution is 💸. So lawyers are necessarily making some guesses when discussing it.
Have been thinking a lot about the gap between what people want to be possible, and what is possible, when using the technology of copyright licensing. This paper feels very timely.
The ODbL drafting group was very optimistic about making copyleft possible in data, but we did not think pessimistically enough about the challenges; or perhaps we did but did not communicate them well enough to the client.
(Certainly in license drafting, as in software coding I suppose, clients have every incentive to find the lawyer who will say “yes, I can do that”. We have some institutional and ethical checks on that... but they can only do so much in the face of good-faith optimism.)
Other future blog topics: Do we still expect “mass gathering of innovators and participants” primarily (or even in part) by dint of an open source license? Data clearly says no (median # of participants on both SourceForge and GH: 1), but...
... one wonders if the dev community is now large enough (and the downsides of large-but-sub-IPO popularity bad enough) that “mass gathering” is the wrong goal, and instead healthy, ”small“ (still bigger than orig Apache!) communities is a better target?
Hot take: the @ml5js license draft is roughly infinitely more interesting than Elastic/Mongo/SSPL.
There are some things I have learned from the furor around SSPL, that inform how I will think in a healthy, coherent way about evaluating new licenses. Blog post tomorrow about this, based on my post from last week: blog.tidelift.com/so-you-want-to…
But the arguments around network or non-compete licenses, and to a lesser extent the license text itself, are reasonably well-trod turf at this point. (Not *completely* old; Elastic, by virtue of size and influence, does merit attention as an industry bellwether!)