AndriyMulyar Profile picture
Apr 13 13 tweets 5 min read Twitter logo Read on Twitter
Announcing GPT4All-J: The First Apache-2 Licensed Chatbot That Runs Locally on Your Machine💥
github.com/nomic-ai/gpt4a…

Large Language Models must be democratized and decentralized.
We improve on GPT4All by:
- increasing the number of clean training data points
- removing the GPL-licensed LLaMa from the stack
- Releasing easy installers for OSX/Windows/Ubuntu
Details in the technical report: s3.amazonaws.com/static.nomic.a…
GPT4All-J is packaged in an easy-to-use installer. You are a few clicks away from a locally running large language model that can
- answer questions about the world
- write poems and stories
- draft emails and copy
all without the need for internet access.
gpt4all.io
Alongside installers, we release the training data, model weights and perform extensive evaluations of comparable models: Image
You can explore the final curated training set in Atlas
atlas.nomic.ai/map/gpt4all-j-…

You'll find large regions dedicated to creative prompts like stories and poems in addition to an increased number of multi-turn responses. Image
The LLM itself is an assistant-finetuned version of GPT-J - an LLM released by EleuthrrAI under an Apache-2 License.

GPT-J was more difficult to train than LLaMa but with increased high-quality data and the tireless work of @zach_nussbaum we succeeded.
The GPT4All movement grows by the day.
Our community is 10k people strong and filled with elite open-source hackers paving the way to a decentralized future.
We will open-source the data. We will open-source the models. #GPT4All
Join the movement: discord.gg/mGZE39AS3e
A few neat things we learned along the way:
GPT-J suffered from significant overfitting during early experimentation. We were able to identify the data responsible for this by filtering for high-loss training examples in Atlas.

atlas.nomic.ai/map/gpt4all-j-… Image
GPT-J really enjoyed memorizing creative examples when we had only a few of them. This directed us to increase the diversity of creative examples leading to a model that could produce more diverse and higher-quality poems, songs, and stories.
GPT-J is certainly a worse model than LLaMa. It was much more difficult to train and prone to overfitting. That difference, however, can be made up with enough diverse and clean data during assistant-style fine-tuning.
The work was worth it. We now have an Apache 2 assistant-style LLM that everyone can build on and it was built with the contributions and input of the entire GPT4All community.
QED.
But the demonstration certainly is not finished.
We need your help to keep improving. When you download the installer you will have the chance to opt-in to share your data with the GPT4All Open Source data lake.

Opting in you will allow GPT4All models to improve in quality and you to contribute to building LLMs.
I suppose I forgot to mention that this model runs on your CPU with 4 GBs of RAM at 10 words (tokens) per second.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with AndriyMulyar

AndriyMulyar Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @andriy_mulyar

Apr 11
A GPT4All does not support or subvert specific political ideologies or choose winners.

google.com/amp/s/news.yah…

open source the data
open source the models
#gpt4all.
As governments realize this foundational technology challenges their power, we can expect more of these types of rulings.

On the research side, this indicates that agendas centered around controllable LLMs will explode. Controlling LLMs is about controlling their training data.
What happens in a world where Pepsi pays companies to upsample instances of Pepsi in pretraining/fine-tuning over Coke building hidden but corporately useful biases in the LLM?

GPT4All is a foil against this world.
Read 4 tweets
Mar 28
I'm excited to announce the release of GPT4All, a 7B param language model finetuned from a curated set of 400k GPT-Turbo-3.5 assistant-style generation.
We release💰800k data samples💰 for anyone to build upon and a model you can run on your laptop!
Real-time Sampling on M1 Mac
Inspired by learnings from Alpaca, we carefully curated ~800k prompt-response samples to produce 430k high-quality assistant-style prompt/generation training pairs including code, dialogue, and stories.

Detailed procedure for replication and data: github.com/nomic-ai/gpt4a…
Some samples (out of training set)
Valid Python generation with markdown
Read 9 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(