Tweet

Philip Vollet

2 Jun, 14 tweets, 6 min read

Do you need social media data for your machine learning project?

- Twitter data?
- Reddit data?
- Facebook data?

Where to get it?

Reddit: Pushshift

Pushshift is a big-data storage and analytics project.

Most people know it for its copy of reddit comments and submissions.

reddit.com/r/pushshift/co…

Reddit: Pushshift API

The pushshift.io Reddit API was designed and created by the /r/datasets mod team to help provide enhanced functionality and search capabilities for searching Reddit comments and submissions.

github.com/pushshift/api

Reddit: Pushshift file download

Note: The latest data for manual download is from April 2020

files.pushshift.io/reddit/comment…

Reddit: PMAW: Pushshift Multithread API Wrapper

PMAW is an ultra minimalist wrapper for the Pushshift API which uses multithreading to retrieve Reddit comments and submissions.

If you pull data via Pushshift use PMAW, highly recommended!

github.com/mattpodolak/pm…

Reddit: Redditsearch

Frontend which uses Pushshift for detail searches on subreddits or domain

redditsearch.io

Twitter: Stream as download

The Internet Archive is a digital library of Internet sites and other cultural artifacts in digital form.

Note: The last archived data is from January 2021

archive.org/search.php?que…

Twitter: Tweepy Twitter for Python!

An easy-to-use Python library for accessing the Twitter API.

Note: The downside is the API limitations of Twitter, so you need a lot of time.

github.com/tweepy/tweepy

Twitter: Script

Most twitter scraper are banned by Twitter or no longer work so here is a simple and unlimited twitter scraper with python and without authentication

Note: Headless mode no longer work and it uses Selenium to access Twitter

github.com/Altimis/Scweet

Facebook: Scrape Facebook public pages without an API key.

$ pip install facebook-scraper

github.com/kevinzg/facebo…

Facebook: Large Page-Page Network data

Nodes represent official Facebook pages while the links are mutual likes between sites.

Node features are extracted from the site descriptions that the page owners created to summarize the purpose of the site.

snap.stanford.edu/data/facebook-…

Octoparse: Easy Web Scraping for Anyone

Everything you need to automate your web scraping.

Note: It's a paid service.

octoparse.com

@philipvollet

Spread the open source love!

If you know an amazing project drop me message @philipvollet

we need this edit function. my inner zen isn't balanced every time i spot a typo

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @philipvollet

Philip Vollet

@philipvollet

3 Jun

@coqui_ai

coqui a library for advanced Text-to-Speech generation.

New paper: Efficient zero-shot multi-speaker text-to-speech model that improves similarity and speech quality for new speakers unseen in training.

@coqui_ai

$ pip install TTS

github.com/coqui-ai/TTS

The model reaches state-of-the-art results for similarity with new speakers and speech quality with only 11 speakers in training.

SC-GlowTTS: An Efficient Zero-Shot Multi-Speaker Text-To-Speech Model

edresson.github.io/SC-GlowTTS

Coqui has also a newsletter coqui.ai/?subscription=… for everyone whos interested in Text-to-Speech

Read 4 tweets

Philip Vollet

@philipvollet

22 May

Where to find trending machine learning papers?

3 tools to find what's trending:

Find trending ArXiv papers on arxiv-sanity.com you can sort by categories and save for later reading

42papers a collaborative community to discover and read great papers together on the web.

42papers.com

Read 5 tweets

Philip Vollet

@philipvollet

22 May

Why are graphs the future of biomedical research and what is the value of NLP here?

A small case study about:

How to speed up drug discovery with knowledge graphs and discover potential cures for diseases

In this case text mining is used to contextualize knowledge about:

- Genes
- Compounds
- Diseases
- Adverse drug effects
- Receptor bindings

Which text types are processed here? Medical literature, patient notes, electronic health records, clinical reports etc.

But how to start?

First you need to identify the different entities such as compounds, diseases, adverse drug effects and receptor bindings.

Read 9 tweets

Philip Vollet

@philipvollet

20 May

Image Cropping on Twitter: Fairness Metrics, their Limitations, and the Importance of Representation, Design, and Agency

github.com/twitter-resear…

In fall 2020, Twitter users raised concerns that the automated image cropping system on Twitter favored light-skinned over dark-skinned individuals, as well as concerns that the system favored cropping woman's bodies instead of their heads

arxiv.org/abs/2105.08667

blog.twitter.com/engineering/en…

In order to address these concerns, they conduct an extensive analysis using formalized group fairness metrics

blog.twitter.com/engineering/en…

Read 4 tweets

Philip Vollet

@philipvollet

20 May

Did you think bringing your machine learning model to production was the hard part?

What about model drift?

Now MLOps comes into play but how does it work and what are good tools?

What is:
- Continuous integration (CI)
- Continuous deployment (CD)
- Continuous training (CT)

The full MLOps life cycle

- Data Engineering: Get and clean the data recurring if necessary
- Model Engineering: Model training, evaluation, testing, and packaging
- Model Deployment: integrating the trained model. Model serving, performance monitoring

Why is MLOps important?

Just because your model is hitting now doesn't mean it will be doing so 6 months from now

Model drift is real!

- Continuous training (CT)

Read 10 tweets

Philip Vollet

@philipvollet

16 May

@obsdmd

Note taking apps are like muscle training - you have to do it every day.

How many times I have changed ...

From Evernote to OneNote to Google Keep to Notion and from Roam now to Obsidian

@obsdmd

Why?

Where the big ones like OneNote, Google Keep and Evernote fail is that the brain does not work like an index, thoughts are linked and associatively this is where the next generation of note taking apps show their strength.

Roam and Obsidian

roamresearch.com

Map your notes and thoughts into a graph and weave them together.

What bothered me about Roam is that it doesn't have a native client and only runs in the browser, and this is where Obsidian comes in!

Read 4 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!