Where to get data for your next machine learning project?

An overview of 8 amazing resources to accelerate your next project with data!

- Google Datasets
- Big Bad NLP Datasets
- Hugging Face Datasets
- Papers with Code Datasets
- Open Data on AWS
- Awesome Public Datasets
Hugging Face Datasets

Mainly for NLP but the good news Hugging Face is expanding and we can be sure that they will add datasets for visual machine learning soon!

@huggingface

huggingface.co/datasets
Big Bad NLP Datasets

One of the best sources for sophisticated Natural Language Processing datasets

@Quantum_Stat

datasets.quantumstat.com
Papers with Code Datasets

3,830 machine learning datasets with a supreme search and a good composition of datasets

@paperswithdata

paperswithcode.com/datasets
Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources

registry.opendata.aws
Azure public data sets

Public data sets for testing and prototyping

A bit outdated but a credible source

docs.microsoft.com/en-us/azure/az…
Carnegie Mellon University

A listing of 750 databases, datasets, and research support tools.

guides.library.cmu.edu/az.php
Google Datasets

It is as simple to search Datasets on Google Dataset Search as it is to search for anything on Google Search! You just enter the topic on which you need to find a Dataset

datasetsearch.research.google.com
Awesome Public Datasets

A topic-centric list of HQ open datasets.

github.com/awesomedata/aw…
Kaggle Datasets

Explore, analyze, and share quality data.

kaggle.com/datasets
The home of the U.S. Government’s open data

Here you will find data, tools, and resources to conduct research, develop web and mobile applications, design data visualizations

data.gov

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Philip Vollet

Philip Vollet Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @philipvollet

15 May
Your open source project is ready for deployment? Documentation is still missing?

Good documentation and its presentation is an art!

A case study with 4 examples on awesome documentation
What makes good documentation?

- No prosaic texts! Choose a practical approach with code snippets
- Good structure and overview with a quick entry then in depth
- Good search is everything
- Good code examples
A superbly executed documentation is the one by @explosion_ai about @spacy_io

Why?
-Extremely good search
-These diagrams eye candy everywhere!
-Interactivity
-Live code examples that can be customized and run in a Binder container

spacy.io
Read 13 tweets
11 May
How to get your dream job in Data Science if you are a career changer?

First you have to sneak around HR and their antiquated methods. This is only possible through contacts or unusual ways.

But what are good ways?
The middleman

Someone who can hand over your application who has a connection to the company or someone who works there.
The direct way, but be careful this must be done well

You look for a contact person via Linkedin, but the pitch has to be right and you really have to have an interesting application. Otherwise it looks like spam and you are out of the game forever.

How?
Read 6 tweets
10 May
Does BERT Pretrained on Clinical Notes Reveal Sensitive Data? • Large Transformers pretrained over clinical notes from Electronic Health Records (EHR) have afforded substantial gains in performance on predictive clinical tasks.

Paper arxiv.org/abs/2104.07762
GitHub

↓ 1/4
github.com/elehman16/expo…

The cost of training such models and the necessity of data access to do so is coupled with their utility motivates parameter sharing, i.e., the release of pretrained models such as ClinicalBERT.

2/4
While most efforts have used deidentified EHR, many researchers have access to large sets of sensitive, non-deidentified EHR with which they might train a BERT model (or similar).

Would it be safe to release the weights of such a model if they did?

3/4
Read 4 tweets
8 May
To build a chatbot you need data for your intent classification.

But what if you have too little training data?

Paraphrasing is one option for augmentation

But what is a good paraphrase? Image
Almost all conditioned text generation models are validated on 2 factors:

1. If the generated text conveys the same meaning as the original context (Adequacy)

2. If the text is fluent / grammatically correct english (Fluency)
For instance Neural Machine Translation outputs are tested for Adequacy and Fluency.
Read 11 tweets
7 May
How do you create a beautiful interface for your machine learning or data science project?

Handmade from scratch?
Any good tools?

Sure there are incredible tools:
Beautiful ML & DS interfaces

Gradio
Quickly create customizable UI components around your ML models. By dragging-and-dropping in your own images, pasting your own text, recording your own voice & seeing what the model outputs.

@GradioML

github.com/gradio-app/gra…
Beautiful ML & DS interfaces

Dash apps bring Python analytics to everyone with a point-&-click interface to models written in Python, R & Julia - vastly expanding the notion of what's possible in a traditional dashboard.

@plotlygraphs

plotly.com/dash
Read 8 tweets
6 May
FELIX, a fast and flexible text-editing system that models large structural changes and achieves a 90x speed-up compared to seq2seq approaches whilst achieving impressive results on four monolingual generation tasks.

Abs aclweb.org/anthology/2020…

github.com/google-researc…
Compared to traditional seq2seq methods, FELIX has the following three key advantages:

Sample efficiency: Training a high precision text generation model typically requires large amounts of high-quality supervised data.
FELIX uses three techniques to minimize the amount of required data:

(1) fine-tuning pre-trained checkpoints,
(2) a tagging model that learns a small number of edit operations, and
(3) a text insertion task that is very similar to the pre-training task.
Read 4 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!

:(