My Authors
Read all threads
Been writing Jupyter notebooks for a Python, alt data/NLP + large datasets course. As a result been using many libraries for large datasets.. here are a few takeaways on large datasets in Python 1/8
If you've got a small time series dataset, Pandas is often easiest choice, but what if it doesn't fit in memory? Can batch calculations, but what about libraries who do this for you! 2/8
Dask seems pretty easy to use, does all this "batching" for you and looks very much like Pandas, can work on a cluster too 3/8
Vaex is newer library. Very impressive how quick it is on massive datasets, only loads what it needs from disk. Feature set not as rich yet as Dask (yet) though, & syntax can be different from Pandas. 4/8
PySpark is another choice, easier than I thought, and has Koalas a Pandas like interface - but still found Dask easier to use 5/8
Of course, can also use databases too. SQLite is probably easiest to use. kdb+/q (ok, not open source) but works well with massive time series. 6/8
Future thread will be on speeding up calculations in general.. will be talking about Numba, Cython etc. 7/8
Will prob write blog on this in a few days time! Any thoughts welcome on stuff you've found useful when working with big datasets 8/8
Missing some Tweet in this thread? You can try to force a refresh.

Enjoying this thread?

Keep Current with Saeed

Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Follow Us on Twitter!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!