3.1/n Continuing here the twitter summary of my keynote talk ("Best practices of using machine learning in businesses") at the @BudapestBI data science conference. Tip #2: Use open source (machine learning tools)
3.2/n A majority of people use R or python for data science nowadays. One can access most of the best open source machine learning libraries from R and python. Open source is great not only because it's free, but...
3.3/n ... it also has the largest and most active communities (meetups, conferences), most widely available documentation (tons of books, blog posts) and one can get a fantastic help via mailinglists or stackoverflow.
3.4/n I switched from a commercial (and pretty expensive) product to an open source tool (R) in 2006, not because of cost, but because already at that time R was better/or at least as good for many kind of problems I had to work on. It only got better ever since.
3.5/n Here are some of the open source machine learning tools I have used/tried out (libs available usually from R and python too), such as xgboost and lightgbm for GBMs, Keras/TF for deep learning, h2o with many algos, Vowpal Wabbit for large sparse linear models etc.
3.6/n Based on various sources (talking to people at conferences/meetups, kaggle forums, blog posts, my informal twitter polls etc.) these are also the tools a majority of data scientists are using.
3.7/n The best open source tools are on par or better in features and performance compared to the commercial tools, so unlike 10+ years ago when a majority of people used various expensive tool, nowadays open source rules.
3.8/n If you care about more details on which ML tools to use, I have given several more technical talks on this topic with some comparison/guidance on features and performance, you can find these talks (video recordings) online.
4.1/n Tip #3: Prefer simple to complex. For example try avoiding distributed computing tools for machine learning (in most cases you don't need them).
4.2/n While there has been a fetish for distributed "big data" ML tools (the bigger the better, you look cool if you have a cluster vs just 1 server etc), most distributed tools are just slow, buggy and lacking in features.
4.3/n There are clear benchmarks showing e.g. Spark being 10x slower, using 10x more RAM and even having accuracy issues (bugs) compared to the best tools for example for random forests or GBMs.
4.4/n But don't believe me, just try out Spark on a cluster vs lightgbm for example on your laptop and see which one trains faster on your data. 😂😂😂 (and thank me later)
4.5/n And the good news is that you most likely don't need distributed "big data" ML tools. Even if you have Terabytes of raw data (e.g. user clicks) after you prepare/refine your data for ML (e.g. user behavior features) your model matrix is much smaller and will fit in RAM.
4.6/n Even "big data" companies have nowadays tools that focus on not so big data (because very often even for them that's the most common use case at least for ML). If it's good for example for Netflix it should be good for you too.
4.7/n This tip used to be controversial a few years ago, but by now most people have realized/learned this. People just want better/faster single machine ML tools.
4.8/n That's it for today, I'll continue this thread with more tips on machine learning best practices tomorrow 😎 Also here is the link to yesterday's tips in case you have not seen them
Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to Szilard [Deeper than Deep Learning]
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member and get exclusive features!

Premium member ($30.00/year)

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!