Do you need social media data for your machine learning project?

- Twitter data?
- Reddit data?
- Facebook data?

Where to get it?
Reddit: Pushshift

Pushshift is a big-data storage and analytics project.

Most people know it for its copy of reddit comments and submissions.

reddit.com/r/pushshift/co…
Reddit: Pushshift API

The pushshift.io Reddit API was designed and created by the /r/datasets mod team to help provide enhanced functionality and search capabilities for searching Reddit comments and submissions.

github.com/pushshift/api
Reddit: Pushshift file download

Note: The latest data for manual download is from April 2020

files.pushshift.io/reddit/comment…
Reddit: PMAW: Pushshift Multithread API Wrapper

PMAW is an ultra minimalist wrapper for the Pushshift API which uses multithreading to retrieve Reddit comments and submissions.

If you pull data via Pushshift use PMAW, highly recommended!

github.com/mattpodolak/pm…
Reddit: Redditsearch

Frontend which uses Pushshift for detail searches on subreddits or domain

redditsearch.io
Twitter: Stream as download

The Internet Archive is a digital library of Internet sites and other cultural artifacts in digital form.

Note: The last archived data is from January 2021

archive.org/search.php?que…
Twitter: Tweepy Twitter for Python!

An easy-to-use Python library for accessing the Twitter API.

Note: The downside is the API limitations of Twitter, so you need a lot of time.

github.com/tweepy/tweepy
Twitter: Script

Most twitter scraper are banned by Twitter or no longer work so here is a simple and unlimited twitter scraper with python and without authentication

Note: Headless mode no longer work and it uses Selenium to access Twitter

github.com/Altimis/Scweet
Facebook: Scrape Facebook public pages without an API key.

$ pip install facebook-scraper

github.com/kevinzg/facebo…
Facebook: Large Page-Page Network data

Nodes represent official Facebook pages while the links are mutual likes between sites.

Node features are extracted from the site descriptions that the page owners created to summarize the purpose of the site.

snap.stanford.edu/data/facebook-…
Octoparse: Easy Web Scraping for Anyone

Everything you need to automate your web scraping.

Note: It's a paid service.

octoparse.com
Spread the open source love!

If you know an amazing project drop me message @philipvollet
we need this edit function. my inner zen isn't balanced every time i spot a typo

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Philip Vollet

Philip Vollet Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @philipvollet

3 Jun
coqui a library for advanced Text-to-Speech generation.

New paper: Efficient zero-shot multi-speaker text-to-speech model that improves similarity and speech quality for new speakers unseen in training.

@coqui_ai

$ pip install TTS

github.com/coqui-ai/TTS
The model reaches state-of-the-art results for similarity with new speakers and speech quality with only 11 speakers in training.

SC-GlowTTS: An Efficient Zero-Shot Multi-Speaker Text-To-Speech Model

edresson.github.io/SC-GlowTTS
Coqui has also a newsletter coqui.ai/?subscription=… for everyone whos interested in Text-to-Speech
Read 4 tweets
22 May
Where to find trending machine learning papers?

3 tools to find what's trending:
Find trending ArXiv papers on arxiv-sanity.com you can sort by categories and save for later reading
42papers a collaborative community to discover and read great papers together on the web.

42papers.com
Read 5 tweets
22 May
Why are graphs the future of biomedical research and what is the value of NLP here?

A small case study about:

How to speed up drug discovery with knowledge graphs and discover potential cures for diseases
In this case text mining is used to contextualize knowledge about:

- Genes
- Compounds
- Diseases
- Adverse drug effects
- Receptor bindings
Which text types are processed here? Medical literature, patient notes, electronic health records, clinical reports etc.

But how to start?

First you need to identify the different entities such as compounds, diseases, adverse drug effects and receptor bindings.
Read 9 tweets
20 May
Image Cropping on Twitter: Fairness Metrics, their Limitations, and the Importance of Representation, Design, and Agency

github.com/twitter-resear…
In fall 2020, Twitter users raised concerns that the automated image cropping system on Twitter favored light-skinned over dark-skinned individuals, as well as concerns that the system favored cropping woman's bodies instead of their heads

arxiv.org/abs/2105.08667
In order to address these concerns, they conduct an extensive analysis using formalized group fairness metrics

blog.twitter.com/engineering/en…
Read 4 tweets
20 May
Did you think bringing your machine learning model to production was the hard part?

What about model drift?

Now MLOps comes into play but how does it work and what are good tools?

What is:
- Continuous integration (CI)
- Continuous deployment (CD)
- Continuous training (CT) Image
The full MLOps life cycle

- Data Engineering: Get and clean the data recurring if necessary
- Model Engineering: Model training, evaluation, testing, and packaging
- Model Deployment: integrating the trained model. Model serving, performance monitoring
Why is MLOps important?

Just because your model is hitting now doesn't mean it will be doing so 6 months from now

Model drift is real!

- Continuous training (CT)
Read 10 tweets
16 May
Note taking apps are like muscle training - you have to do it every day.

How many times I have changed ...

From Evernote to OneNote to Google Keep to Notion and from Roam now to Obsidian

@obsdmd

Why?
Where the big ones like OneNote, Google Keep and Evernote fail is that the brain does not work like an index, thoughts are linked and associatively this is where the next generation of note taking apps show their strength.

Roam and Obsidian

roamresearch.com
Map your notes and thoughts into a graph and weave them together.

What bothered me about Roam is that it doesn't have a native client and only runs in the browser, and this is where Obsidian comes in!
Read 4 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!

:(