Alex J. Champandard 🌱 Profile picture
Jul 1, 2021 38 tweets 14 min read Read on X
Watch carefully as GitHub PR tries to (re)define copyright and set a precedent that the licensing terms of open source code don't apply in this case...

Left: Text from June 29th.
Right: Edited text on July 1st.

See FAQ section at copilot.github.com ImageImage
Just for the record: it's not considered fair use, @github.

It's highly controversial in the community.

Multiple datasets have been removed from the internet because of such legal issues, and courts have forced companies to delete models built on data acquired questionably.
There was a mobile "face app" that had questionable terms of service, and the company used the data to train their models. In the ruling, they were ordered to delete the models.

I can't remember the company or app name, will post below when I find it...
Here it is. techcrunch.com/2021/01/12/ftc…

GitHub is in a similar position because it's using data from users "without properly informing them what it was doing."

In short, "users who did not give express consent to such a use" applies here too since License terms would be broken.
It certainly is a different case. But just to refute the original premise: it's not generally considered fair use in the community.

It's a controversial topic and what we're seeing now is a coordinated PR campaign until GitHub can help set the standards.
This is a digital version of the Tragedy Of The Commons.

It's very different in practice as there's no scarcity digitally, but the impact is arguably bigger... en.wikipedia.org/wiki/Tragedy_o…
FWIW, even the Creative Commons website has been nudging people for years towards using licenses that can easily be exploited commercially.

You're not approved for "Free Culture" if you don't let multi-nationals profit from your work! ImageImage
I think there's a small window of opportunity to ensure that new legislation doesn't benefit multi-nationals exclusively.

If this matter is left to sponsored Think Tanks, then regular people will suffer...
They understand fully the importance of this.

They have a legal team working on it. They said they want to be a part of defining future standards. If that fails, they'll update their Terms Of Service.
The law works differently than you think it does! If you have the money and the intention, *everything* can be challenged.

You ask your expensive lawyer: "I want to win this case and establish this precedent" and they'll find many options.
Let's consider the fact there is no Fair Use directive in the EU, and in the US there are four factors that can be debated and argued until precedent is set in such cases.
When hosting code on GitHub, it would fall under the Terms Of Service of a traditional business relationship. We possibly have better protection & recourse (e.g. with EU Regulators or FTC, as above) than if repositories were hosted elsewhere...
They say it's trained on public data, and I highly doubt they trained on private repositories.

GPT-3 and similar size models have been proven to learn samples verbatim, so it'd be a huge breach of contract if information leaked!
This is the core of the issue.

GitHub's Terms and Conditions do not allow it to train on your code or exploit it commercially. See section D here: docs.github.com/en/github/site…

Image
GitHub can "parse" — but if you claim Deep Learning is just parsing @ylecun will tell you it's extrapolation and that's difficult.

GitHub can "store" — but if you claim Deep Learning is just storing @SchmidhuberAI will find a paper from 1986 to prove you wrong.
Since they trained on content hosted on GitHub that already falls under their T&C, when they call it "Fair Use" it's a way to circumvent the explicit license — which could weaken their argument legally and makes it very different from other cases.
Facebook regularly trains AI models on photos from Instagram, but I presume they have the right to do anything because their T&C now probably cover everything their lawyers could think of ;-)
Thank you Dan! This is an important thing to do: please make a statement publicly about this or retweet someone else's.
It's easy for machines to track licenses, and companies have a responsibility of due care & legally bound to respect such contracts, hence they should do so!
Going to wrap up this thread...

TL;DR I'd also argue that GitHub is in a weak position on Fair Use, an even weaker because of their T&C explicitly does not allow "selling" our code as part of a derived-works like CoPilot.
Reopening this thread with more insights on Copyright and GitHub Copilot. 🔓🖋️

There were so many great / interesting replies I want to track them in one place...
ATTRIBUTION

GitHub is able to track whether generated snippets match the database on a character-level, but it's going to require better matching for users to do correct attribution of code snippets.

LIABILITY

If the generated code has legal implications (e.g. incompatible license), it's the end-users that will be responsible.

This came up many times!

One solution would be to redesign the product, likely training multiple models based on what licenses are compatible with the end-user's requirements.

UNLICENSED CODE

Assuming all public repositories were used for training, even those without a license: this seems to be yet another case for their legal team to handle! It's different than claiming the right to train on GPL or non-commercial licenses.

NEW LICENSES

This also came up many times! Should we design licenses that explicitly specify whether ML models can or cannot be trained on the code?

(Right now GitHub's claim of Fair Use is side-stepping the licenses anyway.)

VERBATIM CONTENT

Fast-inverse square root is memorized as-is with comments and license. If it can be used as a search engine, maybe it should be treated as such legally?

DERIVATIVE WORKS

For databases, the index is considered a separate copyright than the content: opendatacommons.org/faq/licenses/

However, models like GPT-3 combine both content & index so the whole thing is a derivative work of the content.

ACCOUNTABILITY

GitHub / Microsoft is accountable in many ways, it just needs someone to actually *hold* them accountable — otherwise (by default) the company is expected to maximize shareholder value.

MIGRATING PLATFORMS

By hosting code on GitHub, it falls under their Terms & Conditions — which (as of now) do *not* allow selling your code, also does not allow training.

In short, the legal options may actually be better by being on GitHub for now...
LEGALEZE

These cases are generally fascinating because there are so many different overlapping aspects! Here, for Ever (face database), it was apparently not about copyright but informed consent for private data.

LOBBYING

Public opinion can often sway juries and even judges. This is why you see PR campaigns alongside the court cases, e.g. Epic vs. Apple.

CoPilot's launch is a PR campaign in multiple ways.

FINE LINE

Doesn't Disney have an entire legal department dedicated to ensuring copyright laws get changed just before Mickey Mouse expires?

This will be fun to watch...
FAIR USE

You can find the definition of Fair Use (U.S. only) here:
copyright.gov/fair-use/more-…

However, because it's Case Law, that definition should be interpreted through the lens of the many rulings that have happened over the years.

LATE CAPITALISM

Outrage that licenses for content should be respected? Totally justified! It's a foundation of our field. (Companies have legal responsibilities.)

This is a "Think of the Children" argument, appeal to pity: en.wikipedia.org/wiki/Think_of_…

OPTIMISM

I'm very optimistic how this debate will evolve over the next months. With or without regulation, I expect consortiums to form (already happening) that specialize in providing high-quality, legal & lawful datasets for companies to train on.
NON-PROFIT RESEARCH

The European Union has you covered if you're in academia here. Also for heritage/cultural projects! ✅

SECRET KEYS

GPT-3 models are known to leak personally identifying information. Until now, GitHub claimed that secrets were hallucinations that "appear plausible" but aren't.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Alex J. Champandard 🌱

Alex J. Champandard 🌱 Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @alexjc

Nov 17, 2023
The OpenAI board's only legal responsibility (for which individuals are accountable) is ensuring AGI is built safely and broadly beneficial.

The only actions they could take without fear of legal consequences stems from AGI *not* being built safely... openai.com/our-structure
Image
It got bad enough that they were forced to fire Altman, and they did it in such a way that much of the blame is directed towards him — in case there's public outcry and they risk being called out (or sued) for performing their fiduciary duties.
Any other reason (disagreements on alignment) would open the board up to lawsuits from users and investors. Microsoft's stock took a 20B haircut; you don't do that on a Friday for something minor.

That's my working theory!
Read 4 tweets
Apr 11, 2023
With #generative systems in the spotlight, it's important to understand ©. Announcing v1.0 of my:

🚦COPYRIGHT TRAFFIC-LIGHT SYSTEM🚦

🟢 Green: Full Ownership / Assignment
🌕 Yellow: Exclusive Contracts
🟠 Orange: Non-Exclusive Licenses
🔴 Red: Fair Use & Exceptions

[1/5] 🧶
DISCLAIMER: This is designed as a high-level overview, not legal advice. Copyright is tricky, especially internationally, so you need your own lawyer to work out the details!

Even popular licenses have legal risks and caveats due to being untested in courts worldwide...
There's one extra color in the Traffic Light because that's how many different categories and ways there are to use Copyright (or not).

🟢🌕🟠🔴

The colors indicate increasing amount of risk, lower protection, and reduced freedom. All can *theoretically* be done legally!
Read 35 tweets
Mar 7, 2023
UPDATE: About the Cease & Desist sent to @StabilityAI with a deadline March 1st: I got confirmation of receipt but no formal reply.

The C&D included many points detailing how training and distribution must be conducted to be EU compliant, and they did not have any answer. Image
I have proceeded in good faith & assuming the best, but I now believe:
- Their involvement in SD 1.x, training of 2.x and likely 3.x is non-compliant.
- They know this is the case and are doing their best to cover up.
- Everything you hear from them is carefully crafted PR only.
Stability believes it can raise another $250 million (rumoured valuation $4b) and even if they get caught in 2 years by the time all court cases and appeals are done, they will have unethically cornered the market and destroyed communities without consent.
Read 14 tweets
Mar 7, 2023
When they say #AiArt has no soul, this is what they mean. Only one of those kids (real or not) actually had success, and you can *see* it!

If you are able to reproduce the look of success with #StableDiffusion please post a reply below...
Looking more closely, I think Success Kid also *ate* sand and still had that defiant look — like he's staring down the universe. "I'd do it again."
NOTE: Fingers are not required to have the look of success. In fact, if you have mashed-up AI fingers and still have the look of success, it would enhance the effect.
Read 5 tweets
Mar 5, 2023
⚖️ Rules For Ethical Open-Sourcing ⚖️

If there's a BigTech #AI API available via payment for the feature or model you want to open-source, then it's ethical to open-source yours! 💯

Rationale: Bad actors are already using the API and BigTech is already profiting from that.
⚖️ Rule #2 ⚖️

If your model is less than 10% better on average for a variety of commonly used benchmarks compared to existing open-source models, then it's ethical to open-source yours! 💯
⚖️ Rule #3 ⚖️

If a dedicated team of best-in-class professionals could reproduce your system in less than a week, then it's ethical to open-source yours! 💯
Read 4 tweets
Mar 5, 2023
What's interesting about this proposal in the context of #generative and Copyright: the place with the most creator-friendly legislation can have jurisdiction. Under the Berne Convention, it's where infringement occurs—so that could be the most creator-friendly place you chose.
That is to say, "allowing #generative AI research under ethical standards [...]" that require Copyright are already regulated by the most human-friendly, pro-creator, and jurisdiction is automatically (e.g. that's where data is hosted, service is used) as per Berne.
Thus, we have the tools to make this happen for #generative systems. Just need the guts to enforce them now before the U.S. court system normalizes exploitation that goes against Berne...
Read 5 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(