Watch carefully as GitHub PR tries to (re)define copyright and set a precedent that the licensing terms of open source code don't apply in this case...
Left: Text from June 29th.
Right: Edited text on July 1st.
Just for the record: it's not considered fair use, @github.
It's highly controversial in the community.
Multiple datasets have been removed from the internet because of such legal issues, and courts have forced companies to delete models built on data acquired questionably.
There was a mobile "face app" that had questionable terms of service, and the company used the data to train their models. In the ruling, they were ordered to delete the models.
I can't remember the company or app name, will post below when I find it...
They have a legal team working on it. They said they want to be a part of defining future standards. If that fails, they'll update their Terms Of Service.
Let's consider the fact there is no Fair Use directive in the EU, and in the US there are four factors that can be debated and argued until precedent is set in such cases.
When hosting code on GitHub, it would fall under the Terms Of Service of a traditional business relationship. We possibly have better protection & recourse (e.g. with EU Regulators or FTC, as above) than if repositories were hosted elsewhere...
GitHub can "parse" — but if you claim Deep Learning is just parsing @ylecun will tell you it's extrapolation and that's difficult.
GitHub can "store" — but if you claim Deep Learning is just storing @SchmidhuberAI will find a paper from 1986 to prove you wrong.
Since they trained on content hosted on GitHub that already falls under their T&C, when they call it "Fair Use" it's a way to circumvent the explicit license — which could weaken their argument legally and makes it very different from other cases.
Facebook regularly trains AI models on photos from Instagram, but I presume they have the right to do anything because their T&C now probably cover everything their lawyers could think of ;-)
Thank you Dan! This is an important thing to do: please make a statement publicly about this or retweet someone else's.
It's easy for machines to track licenses, and companies have a responsibility of due care & legally bound to respect such contracts, hence they should do so!
TL;DR I'd also argue that GitHub is in a weak position on Fair Use, an even weaker because of their T&C explicitly does not allow "selling" our code as part of a derived-works like CoPilot.
Reopening this thread with more insights on Copyright and GitHub Copilot. 🔓🖋️
There were so many great / interesting replies I want to track them in one place...
ATTRIBUTION
GitHub is able to track whether generated snippets match the database on a character-level, but it's going to require better matching for users to do correct attribution of code snippets.
Assuming all public repositories were used for training, even those without a license: this seems to be yet another case for their legal team to handle! It's different than claiming the right to train on GPL or non-commercial licenses.
Fast-inverse square root is memorized as-is with comments and license. If it can be used as a search engine, maybe it should be treated as such legally?
GitHub / Microsoft is accountable in many ways, it just needs someone to actually *hold* them accountable — otherwise (by default) the company is expected to maximize shareholder value.
These cases are generally fascinating because there are so many different overlapping aspects! Here, for Ever (face database), it was apparently not about copyright but informed consent for private data.
I'm very optimistic how this debate will evolve over the next months. With or without regulation, I expect consortiums to form (already happening) that specialize in providing high-quality, legal & lawful datasets for companies to train on.
NON-PROFIT RESEARCH
The European Union has you covered if you're in academia here. Also for heritage/cultural projects! ✅
GPT-3 models are known to leak personally identifying information. Until now, GitHub claimed that secrets were hallucinations that "appear plausible" but aren't.
The OpenAI board's only legal responsibility (for which individuals are accountable) is ensuring AGI is built safely and broadly beneficial.
The only actions they could take without fear of legal consequences stems from AGI *not* being built safely... openai.com/our-structure
It got bad enough that they were forced to fire Altman, and they did it in such a way that much of the blame is directed towards him — in case there's public outcry and they risk being called out (or sued) for performing their fiduciary duties.
Any other reason (disagreements on alignment) would open the board up to lawsuits from users and investors. Microsoft's stock took a 20B haircut; you don't do that on a Friday for something minor.
🟢 Green: Full Ownership / Assignment
🌕 Yellow: Exclusive Contracts
🟠 Orange: Non-Exclusive Licenses
🔴 Red: Fair Use & Exceptions
[1/5] 🧶
DISCLAIMER: This is designed as a high-level overview, not legal advice. Copyright is tricky, especially internationally, so you need your own lawyer to work out the details!
Even popular licenses have legal risks and caveats due to being untested in courts worldwide...
There's one extra color in the Traffic Light because that's how many different categories and ways there are to use Copyright (or not).
🟢🌕🟠🔴
The colors indicate increasing amount of risk, lower protection, and reduced freedom. All can *theoretically* be done legally!
UPDATE: About the Cease & Desist sent to @StabilityAI with a deadline March 1st: I got confirmation of receipt but no formal reply.
The C&D included many points detailing how training and distribution must be conducted to be EU compliant, and they did not have any answer.
I have proceeded in good faith & assuming the best, but I now believe:
- Their involvement in SD 1.x, training of 2.x and likely 3.x is non-compliant.
- They know this is the case and are doing their best to cover up.
- Everything you hear from them is carefully crafted PR only.
Stability believes it can raise another $250 million (rumoured valuation $4b) and even if they get caught in 2 years by the time all court cases and appeals are done, they will have unethically cornered the market and destroyed communities without consent.
Looking more closely, I think Success Kid also *ate* sand and still had that defiant look — like he's staring down the universe. "I'd do it again."
NOTE: Fingers are not required to have the look of success. In fact, if you have mashed-up AI fingers and still have the look of success, it would enhance the effect.
If there's a BigTech #AI API available via payment for the feature or model you want to open-source, then it's ethical to open-source yours! 💯
Rationale: Bad actors are already using the API and BigTech is already profiting from that.
⚖️ Rule #2 ⚖️
If your model is less than 10% better on average for a variety of commonly used benchmarks compared to existing open-source models, then it's ethical to open-source yours! 💯
⚖️ Rule #3 ⚖️
If a dedicated team of best-in-class professionals could reproduce your system in less than a week, then it's ethical to open-source yours! 💯
What's interesting about this proposal in the context of #generative and Copyright: the place with the most creator-friendly legislation can have jurisdiction. Under the Berne Convention, it's where infringement occurs—so that could be the most creator-friendly place you chose.
That is to say, "allowing #generative AI research under ethical standards [...]" that require Copyright are already regulated by the most human-friendly, pro-creator, and jurisdiction is automatically (e.g. that's where data is hosted, service is used) as per Berne.
Thus, we have the tools to make this happen for #generative systems. Just need the guts to enforce them now before the U.S. court system normalizes exploitation that goes against Berne...