Philip Vollet Profile picture
Jun 2, 2021 14 tweets 6 min read Read on X
Do you need social media data for your machine learning project?

- Twitter data?
- Reddit data?
- Facebook data?

Where to get it?
Reddit: Pushshift

Pushshift is a big-data storage and analytics project.

Most people know it for its copy of reddit comments and submissions.

reddit.com/r/pushshift/co…
Reddit: Pushshift API

The pushshift.io Reddit API was designed and created by the /r/datasets mod team to help provide enhanced functionality and search capabilities for searching Reddit comments and submissions.

github.com/pushshift/api
Reddit: Pushshift file download

Note: The latest data for manual download is from April 2020

files.pushshift.io/reddit/comment…
Reddit: PMAW: Pushshift Multithread API Wrapper

PMAW is an ultra minimalist wrapper for the Pushshift API which uses multithreading to retrieve Reddit comments and submissions.

If you pull data via Pushshift use PMAW, highly recommended!

github.com/mattpodolak/pm…
Reddit: Redditsearch

Frontend which uses Pushshift for detail searches on subreddits or domain

redditsearch.io
Twitter: Stream as download

The Internet Archive is a digital library of Internet sites and other cultural artifacts in digital form.

Note: The last archived data is from January 2021

archive.org/search.php?que…
Twitter: Tweepy Twitter for Python!

An easy-to-use Python library for accessing the Twitter API.

Note: The downside is the API limitations of Twitter, so you need a lot of time.

github.com/tweepy/tweepy
Twitter: Script

Most twitter scraper are banned by Twitter or no longer work so here is a simple and unlimited twitter scraper with python and without authentication

Note: Headless mode no longer work and it uses Selenium to access Twitter

github.com/Altimis/Scweet
Facebook: Scrape Facebook public pages without an API key.

$ pip install facebook-scraper

github.com/kevinzg/facebo…
Facebook: Large Page-Page Network data

Nodes represent official Facebook pages while the links are mutual likes between sites.

Node features are extracted from the site descriptions that the page owners created to summarize the purpose of the site.

snap.stanford.edu/data/facebook-…
Octoparse: Easy Web Scraping for Anyone

Everything you need to automate your web scraping.

Note: It's a paid service.

octoparse.com
Spread the open source love!

If you know an amazing project drop me message @philipvollet
we need this edit function. my inner zen isn't balanced every time i spot a typo

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Philip Vollet

Philip Vollet Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @philipvollet

Jul 27, 2021
Tools that make every data scientist and machine learning engineer better!

The universal swiss army knife infrastructure and scheduling collection:
Redis as a data structures server for caching queue handling

Redis provides access to mutable data structures via a set of commands!

github.com/redis/redis
Celery - Distributed Task Queue with Redis

It’s a task queue with focus on real-time processing, while also supporting task scheduling

Docs docs.celeryproject.org/en/stable/inde…

github.com/celery/celery
Read 9 tweets
Jun 11, 2021
Insights from an open source influencer

I'm often asked how I get my content, over the years I've built an unusual technology stack for it

philipvollet.co

Some insights:
I use Feedly for most content inputs because I can access the content through a single API endpoint and scraping is often pure pain.

@feedly

Feedly saves me a lot of time and manual work.

feedly.com
To pull and enrich my GitHub content I use ghapi from @fastdotai which provides a 100% always-updated coverage of the entire GitHub API

ghapi.fast.ai
Read 8 tweets
Jun 11, 2021
XBNet: An Extremely Boosted Neural Network for Tabular Data with a novel architecture combining tree-based models with neural networks

arxiv.org/abs/2106.05239

github.com/tusharsarkar3/…
Trained by using a novel optimization technique, Boosted Gradient Descent for Tabular Data which increases its interpretability and performance.
This tweet is a collaboration with Tushar Sarkar so why not follow for first hand updates?

linkedin.com/in/tushar-sark…
Read 4 tweets
Jun 11, 2021
Reconstructing Implicit Knowledge with Language Models.

Generating statements that explicate implicit knowledge connecting sentences in text.

aclweb.org/anthology/2021…

github.com/Heidelberg-NLP…
They make use of pre-trained language models which they refine by fine-tuning them on specifically prepared corpora that we enriched with implicit information and by constraining them with relevant concepts and connecting commonsense knowledge paths.
Manual and automatic eva. of the generations shows that by refining language models as proposed they can generate coherent & grammatically sound sentences that explicate implicit knowledge which connects sentence pairs in texts on both in-domain and out-of-domain test data
Read 4 tweets
Jun 10, 2021
Deepface is a lightweight face recognition and facial attribute analysis framework in Python

$ pip install deepface

@serengil

Don't forget to spend some star love for the repository!

github.com/serengil/deepf…
It is a hybrid face recognition framework wrapping state-of-the-art models: VGG-Face, Google FaceNet, OpenFace, Facebook DeepFace, DeepID, ArcFace and Dlib

The library is mainly based on Keras & TensorFlow
This tweet is a collaboration with Sefik Ilkin Serengil so why not follow for first hand updates?

@serengil

Feel free to ask anything in the comments!
Read 4 tweets
Jun 10, 2021
Quant UX is a research, usability and prototyping tool to quickly test your designs & get data driven insights

@quant_ux

Quant-UX makes it simple to validate your ideas. Create a prototype, share a link & learn through user feedback and analytics

quant-ux.com
This tweet is a collaboration with Klaus Schaefers why not follow for first hand updates?

Feel free to ask anything!

linkedin.com/in/klaus-schae…
Read 4 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(