open source the data
open source the models #gpt4all.
As governments realize this foundational technology challenges their power, we can expect more of these types of rulings.
On the research side, this indicates that agendas centered around controllable LLMs will explode. Controlling LLMs is about controlling their training data.
What happens in a world where Pepsi pays companies to upsample instances of Pepsi in pretraining/fine-tuning over Coke building hidden but corporately useful biases in the LLM?
GPT4All is a foil against this world.
This type of BYOB (build your own bias) future can only be prevented by careful data work and open models.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Announcing GPT4All-J: The First Apache-2 Licensed Chatbot That Runs Locally on Your Machine💥 github.com/nomic-ai/gpt4a…
Large Language Models must be democratized and decentralized.
We improve on GPT4All by:
- increasing the number of clean training data points
- removing the GPL-licensed LLaMa from the stack
- Releasing easy installers for OSX/Windows/Ubuntu
Details in the technical report: s3.amazonaws.com/static.nomic.a…
GPT4All-J is packaged in an easy-to-use installer. You are a few clicks away from a locally running large language model that can
- answer questions about the world
- write poems and stories
- draft emails and copy
all without the need for internet access. gpt4all.io
I'm excited to announce the release of GPT4All, a 7B param language model finetuned from a curated set of 400k GPT-Turbo-3.5 assistant-style generation.
We release💰800k data samples💰 for anyone to build upon and a model you can run on your laptop!
Real-time Sampling on M1 Mac
Inspired by learnings from Alpaca, we carefully curated ~800k prompt-response samples to produce 430k high-quality assistant-style prompt/generation training pairs including code, dialogue, and stories.