Preparing for a technical interview for a #DataScience position? These are some of the questions that typically allow me as an interviewer to quickly distinguish between juniors and mediors, including some quick tips 🧵. #Python#pythonprogramming#DataScientist#Jobs
All questions about SQL. Not the hardest thing to learn, but many #DataScientists only start to learn the value of SQL when they actually become part of a dev team. I’m not only talking about SELECT * FROM table, but also about joins, truncates, partitions and constraints.
Interacting with an API. Make sure you know your requests (GET, POST, PUT, DELETE, PATCH), as well as the #Python requests library.
Inner joins vs outer joins. Simple question? Maybe. But when you put #DataScientists on the spot in an interview, feeling pressured to give good answers, more often then not they struggle to answer if they are not joining dataframes on a daily basis.
How would you evaluate {model x} for {use-case y}. Every #DataScientist can tell you about RMSEs, ROC AUC and accuracy, fewer can tell you about log loss and Brier Score, and even fewer manage to properly explain why they chose a given KPI, for a given model, in a given use-case
Clustering vs. Classification, and I don’t mean “what’s the difference”. Given a use-case where you want to differentiate between n groups, how would you approach it and what ML type would you use and why. Example: train a model to distinguish between tactical styles in #soccer
Accuracy or Explainability. There are no wrong answers here, but it’s all about the why. Obviously #DataScientists would prefer both, but what if that’s not possible?
These might seem simple questions you learn all about in uni. But I’m not looking for text book answers here. Questions will be wrapped in a practical use-case, and I’m looking for an applied answer. More difficult than you think.
One might say that some of these (SQL, APIs etc) are not core #DataScience. But I like to ask about more than ML, stats and feature engineering, as those skills are essential when working in a development team. Productizing #DataScience is more than #Python#pythonprogramming
Prepare yourself, not only by knowing the text-book answers. Set-up a local database, connect to a public API with #OpenData and write some #Python code to collect, process and store data. Then try to answers some research questions using the data. You just build a pipeline.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Tactical behavior in #Football has a spatial and a temporal component, and results from interaction with the opponent. It’s key to account for all these aspects in data-driven tactical analysis, as well as to respect the complexity of the temporal and spatial dimensions 🧵
Two years ago I published a systematic review in @EurJSportSci on using big data in #soccer for tactical performance analysis that illustrates the associated challenges and provides a data-driven scientific framework. #DataSciencetinyurl.com/mrxky6ca
The most common analysis issue is the fact that spatial and/or temporal complexity is not respected. For example by aggregating data over multiple minutes, or constructing spatial features aggregating 11 player positions into a single variable.
#DataScientist in a software dev team and #pythonprogramming code for production pipelines? You should think carefully about scalability and integration. One of the things to consider is datatypes, here are some helpful tips 🧵
#Python is a dynamically typed language, but that doesn't mean you shouldn't care about types. Know you dtypes, from "str" to "bool" to "int8" to "float64", and understand their memory footprint and restrictions. Especially when working with larger objects, choose wisely.
Loose the strings. 9/10 times strings can be replaced by categoricals (Pandas) or even better by Enums (docs.python.org/3/library/enum…). This can reduce memory footprint of large dataframes with >30%, and improves performance.