Tweet

How to get URL link on Twitter App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

BigCode

@BigCodeProject

Jun 8 • 7 tweets • 4 min read Twitter logo

Read on Twitter

📣 Introducing ⭐ StarCoder+ & StarChat Beta!

We trained StarCoder on the Falcon model's English web dataset and Instruction-tuned it. Both models rank high in the LLM leaderboard, with strong natural language performance and coding capabilities.

huggingface.co/HuggingFaceH4/…

StarCoderBase showed promise in natural language reasoning despite being trained solely on GitHub code. So we fine-tuned it on the English web dataset used in Falcon pre-training:

huggingface.co/bigcode/starco…
huggingface.co/datasets/tiiua…

The result: ⭐ StarCoder+ a powerful English Language Model with strong coding abilities.

It outperforms all LLaMa models and PaLM-540B on HumanEval and stands out in the LLM leaderboard for < 30B models with a 45.1 MMLU score!

huggingface.co/spaces/Hugging…

We instruction-tuned StarCoder+ on the OpenAssistant Guanaco dataset to get StarChat-beta: a strong chat assistant

Model: huggingface.co/HuggingFaceH4/…
Demo: huggingface.co/spaces/Hugging…

It can build HTML websites and much more...

Give it a try 🚀

📔 Resources:

StarCoderPlus: huggingface.co/bigcode/starco…
StarChat Beta: huggingface.co/HuggingFaceH4/…
StarChat demo: huggingface.co/spaces/Hugging…
StarCoderPlus demo: huggingface.co/spaces/bigcode…

https://twitter.com/BigCodeProject/status/1666856107665666048

Back to the start:

https://twitter.com/BigCodeProject/status/1666856107665666048

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @BigCodeProject

BigCode

@BigCodeProject

May 4

Introducing: 💫StarCoder

StarCoder is a 15B LLM for code with 8k context and trained only on permissive data in 80+ programming languages. It can be prompted to reach 40% pass@1 on HumanEval and act as a Tech Assistant.

Try it here: shorturl.at/cYZ06r

Release thread🧵

In addition to chatting with StarCoder, it can also help you code in the new VSCode plugin. By pressing CTRL+ESC you can also check if the current code was in the pretraining dataset!

marketplace.visualstudio.com/items?itemName…

Today we release two open-access models!

StarCoderBase: trained on 1T tokens in 80+ programming languages huggingface.co/bigcode/starco…

StarCoder: additionally trained on 35B Python tokens that can be prompted to reach 40.8% pass@1 huggingface.co/bigcode/starco…

Read 11 tweets

BigCode

@BigCodeProject

Dec 22, 2022

Announcing a holiday gift: 🎅SantaCoder - a 1.1B multilingual LM for code that outperforms much larger open-source models on both left-to-right generation and infilling!

Demo: hf.co/spaces/bigcode…
Paper: hf.co/datasets/bigco…
Attribution: hf.co/spaces/bigcode…

A🧵:

SantaCoder is trained on Python, Java, and JavaScript and outperforms other large multilingual models such as InCoder (6.7B) or CodeGen-multi (2.7B) considerably!

A lot of pieces from a lot of collaborators came together to get to that result:

The foundation to train SantaCoder is The Stack (v1.1) dataset. Given the relatively small size of our model (1B parameters) we chose three popular programming languages: Python, Java, and JavaScript.

You can check if your code was used for training here: huggingface.co/spaces/bigcode…

Read 15 tweets

BigCode

@BigCodeProject

Nov 29, 2022

Between now and Christmas🎄 we are running a series on experiments to figure out what the best pre-processing is for code datasets such as The Stack. We'll share the W&B dashboards of these 🎅-models so if you are interested you can follow along!

We are training ~1B parameter models on the Python/Java/JavaScript subset of The Stack. On the architecture side we want to evaluate the Fill-in-the-Middle (FIM) objective, as well as multi-query attention.

For preprocessing we are looking at GitHub stars/forks, tokenizer fertility, comment/code ratio, more near-deduplication and other heuristics. The experiments use the new version of The Stack that we'll release soon which removed opt-out requests and refined license filters.

Read 7 tweets

BigCode

@BigCodeProject

Oct 27, 2022

Introducing 📑 The Stack - a 3TB dataset of permissively licensed code in 30 programming languages.

hf.co/datasets/bigco…

You want your code excluded from the model training? There is an opt-out form and data governance plan:

bigcode-project.org/docs/about/the…

Let's take a tour🧵

Dataset collection: With gharchive.org over 220M repos were identified and 137M successfully cloned with over 50B files and 90TB of data. Filtered by extension and permissive licenses this yields 3TB of data. We also make a near-deduplicated version (1.5TB) available.

The dataset includes ~30 programming languages covering common languages such as Java, C/C++ and Python as well as lower resource languages (2GB of Dockerfiles 🐳). If you'd like to see a new language added, feel free to add it in this issue:

github.com/orgs/bigcode-p…

Read 12 tweets

BigCode

@BigCodeProject

Sep 26, 2022

@ServiceNowRSRCH

print("Hello world! 🎉")

Excited to announce the BigCode project led by @ServiceNowRSRCH and @huggingface! In the spirit of BigScience we aim to develop large language models for code in an open and responsible way.

Join here: bigcode-project.org/docs/about/joi…

A thread with our goals🧵

🌸Language models for code (Codex, CodeGen) and the applications they power (AI assisted programming) are gaining traction. Some models have been released, but there are still questions around data governance, robustness of evaluation benchmarks, and the engineering behind them.

📚The first goal of BigCode is to develop and release a dataset large enough to train a state-of-the-art language model for code. We will also ensure that only files from repositories with permissive licenses go into the dataset.

Read 9 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter Twitter Thread URL to Unroll

BigCode

People who liked this thread also liked...

Try unrolling a thread yourself!

More from @BigCodeProject

BigCode

BigCode

BigCode

BigCode

BigCode

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!