Tweet

Luis Villa

Follow @luis_in_brief

Nov 3 • 31 tweets • 6 min read

https://twitter.com/McCoySmith/status/1588292160490967041

Welp.

https://twitter.com/McCoySmith/status/1588292160490967041

Gonna do me some #lawtwitter and live-read this. Let’s go.

1/n: Matthew Butterick’s fonts remain excellent, and it’s a pleasure to the eyes to read a brief set in them. This seems trite—but it’s important, because so many lawyers tell you “we pay attention to the details” and “we are master communicators” and then use Arial.

If they’re fucking up the very basics of typography, what else are they getting wrong? (pro tip: probably a lot, we’re not nearly as detail-oriented as we like to pretend.)

3/n (For example, it took me exactly two tweets to stop doing #/n)

4/n: This is filed as a class action, in US federal court. I am not an expert in federal class action certification, but that’s going to matter a lot so let me dive in a little bit.

5/n: The theory of class actions is that in some cases it’s really hard to get enough people together to make a lawsuit against a well-funded entity worthwhile. Key to that: they have to be “similarly situated”; i.e., they have the same grievances. law.cornell.edu/rules/frcp/rul…

6/n: This makes the definition of the class (who is in? who is out?) important. That’s on p. 8. TLDR: anyone who put open code under the most popular licenses on GitHub since 2015.

7/n: Again, I’m not a class action expert. But the state of the art in defending against GPL infringement lawsuits is to point out all the authors who haven’t joined, either because they don’t care or because they actively oppose enforcement.

8/n: I suspect finding such folks is going to be *very* easy in this case, which may pose problems to class certification. That will be particularly true because…

9/n:… the list of licenses (per the class statement) is all licenses suggested by the GitHub license chooser, including mostly *permissive* licenses as well as multiple licenses with explicit fair use clauses (*GPL* 3 and MPL 2).

https://twitter.com/luis_in_brief/status/1581682268011040768

10/n: I skipped over a lot of the preliminary throat clearing, but probably worth noting that defendants are both GH (duh) but also OpenAI. Microsoft has a very seasoned legal team; I have no idea about OpenAI but we’re likely all about to find out.

https://twitter.com/luis_in_brief/status/1581682268011040768

11/n: The first argued damage is under the DMCA. This doesn’t come up much in open, but DMCA‘s prohibition on removing “copyright information” has been read very broadly by some courts to include even non-DRM watermarks. Backgrounder from 2014 here: meta.wikimedia.org/wiki/Wikilegal…

12/n: (my respect for those who regularly live-tweet legal docs is going up by the moment. My patience for this is quickly wearing thin, especially on this lousy ipad keyboard and with school pickup looming…)

13/n: “the Output is often a near-identical reproduction of code from the training data”. That’s a load-bearing “often”. It’ll be interesting, if this gets to discovery, to see how often this happens as a fraction of the generated code. ∞ monkeys + ∞ keyboards -> John Carmack?

14/n: “Every instance of Output … is derived from material in its Training Data”. Analogies often win (or lose) tech cases. A key battle here is going to be how each side explains what it means to “derive” one work from another here.

15/n: it’s certainly true in a very general sense that Copilot outputs are derived from Codex inputs, but in a way that no court has ever really seen before, and (arguably) not in a way that the copyright statute intends when it speaks of derivatives. copyright.gov/circs/circ14.p…

16/n: p. 15-16, the filing confidently attributes some (fairly short) Codex-sourced code to a specific book. I would be curious to see how many repos in GH contain this exact code, with and without MIT-licensed attribution.

17/n: It’s certainly possible this is the only person who has ever written this function in this way, but I’ve seen at least one confidently-asserted “Copilot copied my stuff” that turned out to widely-copied (or widely-independently-invented?) pre-Copilot.

18/n: (and in Oracle-Google we found a bunch of allegedly copied code that… Sun had donated to Apache.)

18/n: I’m also curious how many function names in the same book lead to literal copying. This doesn’t excuse literal copying, but the optics are interestingly different if this is common or if 99% of the book’s function headers result in non-copying.

19/n: (Note here that this is a good reason why documentation and textbook code-snippets should always be licensed CC0 or similar: even a very mild permissive license may unintentionally be entrapping students who copy code to help themselves learn.)

20/n: paragraphs 54 and 55 go into how an example of emitted code is not high quality code. This is almost certainly true but… I’m not clear what bearing it has? Perhaps explained later?

https://twitter.com/McCoySmith/status/1588305953237049344

21/n: as pointed out 👇🏽“near identical” is doing a lot of work, and indeed in paragraph 60 we lead off with a handwaving of “different line breaks”. Different line breaks shouldn’t be dispositive, but GH is going to point at a lot of differences.

https://twitter.com/McCoySmith/status/1588305953237049344

22/n: To elaborate a bit on that one: of course you can’t avoid a copyright violation just by running a linter with different formatting flags. And at this stage of the litigation, you don’t need perfect proof. But it’s still odd.

23/n: “Ultimately, Codex derives its value primarily from its ability to locate and output potentially useful Licensed Materials”. Boooy. I am of two minds here:

24/n: One, there’s a lot more going on here than just “locate and output”. It’s smart litigation to downplay the generation of new code, but under Google-Oracle and Google Book Search, once a court digs in it sure feels like quintessentially transformative fair use.

25/n: But two: they’re (so far - I’m only on p18 of 56) not really relying on this being copyright infringing at all. If that sticks, the fair use argument *doesn’t matter*, since fair use is not a defense to a DMCA copyright management infringement claim.

26/n: (This, by the way, is one of the ways the DMCA sucks. Anyone who spent the early 00s touting how the DMCA is evil, because it doesn’t protect fair use, should probably tread carefully around this suit.)

@cdgrams

27/n: 🛑 I need to head home from this coffee shop before my wrists break and my kiddo gets home from school. Maybe more tonight, or maybe I’ll chuck all my computing devices into the Bay, or maybe @cdgrams will lock me to a blog post editor instead of twitter :)

[After ride home, am considering deleting this thread. I don’t have time/energy to do this justice right now, and it’s an important case that needs some thoughtful coverage.]

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @luis_in_brief

Luis Villa

@luis_in_brief

Nov 3

@Ostrom_Workshop

Wrote today on the work of Elinor Ostrom (of @Ostrom_Workshop and Nobel fame) and what we can/can’t learn from her in open. tldr: our commons of *code* have never been good matches for her work, but our commons of *developer time* is—and that points at some warning signs.

https://twitter.com/tidelift/status/1588186244504223747

🧵version: Ostrom uncovered design principles that made commons more resilient over long time periods, by studying 900+ IRL commons. She was not dogmatic about the rules: in different circumstances the humans in a system could/would adjust. And adjust open has!

Specifically, we ignore many of the Ostrom principles! And yet, open source has (mostly) flourished anyway. In this essay, I try to unpick that, by showing how the differences between code and (say) a river, explain which principles we could get away with ignoring.

Read 9 tweets

Luis Villa

@luis_in_brief

Nov 1

https://twitter.com/tidelift/status/1587460516741537794

There’s a lot going on in the RAIL licenses—responding to tech changes; cultural changes; and having the unavoidable apps that most new licenses can only dream about. I tried to put down some of my thoughts on them here 👇🏽

https://twitter.com/tidelift/status/1587460516741537794

The license is very clearly not Open Source. And that means companies should use models licensed under it with some trepidation; the essential *predictability* of traditional open is missing.

https://twitter.com/nathanbenaich/status/1579714780235206657

But it’s not open for a reason—the AI community is deeply concerned with the ethical implications of its work 👇🏽 If what you’re offering as an alternative is “ethics means anything goes”, they’re going to ignore you.

https://twitter.com/nathanbenaich/status/1579714780235206657

Read 8 tweets

Luis Villa

@luis_in_brief

Jun 30, 2021

https://twitter.com/luis_in_brief/status/1410241827005894657

Since lots of people are asking overnight, here’s my top-level take on Copilot and IP. Will collect more comments from overnight in a thread here.

https://twitter.com/luis_in_brief/status/1410241827005894657

https://twitter.com/luis_in_brief/status/1410084320237154304

I should preface this by saying that this is all new law, based on analogy to some very old things, and rarely tested in court (because, in programmer terms, there is no CI and each test framework execution is 💸. So lawyers are necessarily making some guesses when discussing it.

https://twitter.com/luis_in_brief/status/1410084320237154304

https://twitter.com/luis_in_brief/status/1410063860078235653

When I say old things…

https://twitter.com/luis_in_brief/status/1410063860078235653

Read 6 tweets

Luis Villa

@luis_in_brief

Feb 5, 2021

https://twitter.com/cfiesler/status/1357477942499160065

Have been thinking a lot about the gap between what people want to be possible, and what is possible, when using the technology of copyright licensing. This paper feels very timely.

https://twitter.com/cfiesler/status/1357477942499160065

The ODbL drafting group was very optimistic about making copyleft possible in data, but we did not think pessimistically enough about the challenges; or perhaps we did but did not communicate them well enough to the client.

(Certainly in license drafting, as in software coding I suppose, clients have every incentive to find the lawyer who will say “yes, I can do that”. We have some institutional and ethical checks on that... but they can only do so much in the face of good-faith optimism.)

Read 11 tweets

Luis Villa

@luis_in_brief

Feb 3, 2021

https://twitter.com/_msw_/status/1356999617846501377

Next time I write about ethical licenses, the NC licenses as a proto-ethical license would be a fruitful topic.

https://twitter.com/_msw_/status/1356999617846501377

https://twitter.com/webmink/status/1357059679092932611

Other future blog topics: Do we still expect “mass gathering of innovators and participants” primarily (or even in part) by dint of an open source license? Data clearly says no (median # of participants on both SourceForge and GH: 1), but...

https://twitter.com/webmink/status/1357059679092932611

https://twitter.com/CoralineAda/status/1357036138242985992

... one wonders if the dev community is now large enough (and the downsides of large-but-sub-IPO popularity bad enough) that “mass gathering” is the wrong goal, and instead healthy, ”small“ (still bigger than orig Apache!) communities is a better target?

https://twitter.com/CoralineAda/status/1357036138242985992

Read 12 tweets

Luis Villa

@luis_in_brief

Feb 3, 2021

@ml5js

Hot take: the @ml5js license draft is roughly infinitely more interesting than Elastic/Mongo/SSPL.

There are some things I have learned from the furor around SSPL, that inform how I will think in a healthy, coherent way about evaluating new licenses. Blog post tomorrow about this, based on my post from last week: blog.tidelift.com/so-you-want-to…

But the arguments around network or non-compete licenses, and to a lesser extent the license text itself, are reasonably well-trod turf at this point. (Not *completely* old; Elastic, by virtue of size and influence, does merit attention as an industry bellwether!)

Read 6 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Luis Villa

People who liked this thread also liked...

Try unrolling a thread yourself!

More from @luis_in_brief

Luis Villa

Luis Villa

Luis Villa

Luis Villa

Luis Villa

Luis Villa

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!