#DataScientist in a software dev team and #pythonprogramming code for production pipelines? You should think carefully about scalability and integration. One of the things to consider is datatypes, here are some helpful tips 🧵
#Python is a dynamically typed language, but that doesn't mean you shouldn't care about types. Know you dtypes, from "str" to "bool" to "int8" to "float64", and understand their memory footprint and restrictions. Especially when working with larger objects, choose wisely.
Loose the strings. 9/10 times strings can be replaced by categoricals (Pandas) or even better by Enums (docs.python.org/3/library/enum…). This can reduce memory footprint of large dataframes with >30%, and improves performance.
Know how #Python dtypes map to types in other languages, even between different packages ("Int64" in numpy vs "int" in Python). As a #DataScientist your output will go to #SQL databases, @DeltaLakeOSS tables or APIs. Whereas Python is dynamically typed, the integrations are not.
Think about missing data. Missing data can implicitly transform int columns to float columns in a dataframe, which will give you problems when creating the output (see above). Also, consider if output can be "nullable" (i.e. contains missing data).
To prevent issues with your output, enforce a schema. The schema package (pypi.org/project/schema/) for example provides excellent tools when outputting a *.json object to for example an API.
Train yourself to consistently use type-hints. Add type annotations to for example your functions using the Typing module. Now, #Python remains dynamically typed, and this won't change that. It enforces you to think about typing more carefully though, and your IDE will do so too.
Finally, these are all small code changes. Doing it right will optimize performance, but don't expect anything drastic. What it will really do is ensure your code can be integrated with downstream dependencies, and not break anything because a 1 turned into a 1.0. #DataScience
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Tactical behavior in #Football has a spatial and a temporal component, and results from interaction with the opponent. It’s key to account for all these aspects in data-driven tactical analysis, as well as to respect the complexity of the temporal and spatial dimensions 🧵
Two years ago I published a systematic review in @EurJSportSci on using big data in #soccer for tactical performance analysis that illustrates the associated challenges and provides a data-driven scientific framework. #DataSciencetinyurl.com/mrxky6ca
The most common analysis issue is the fact that spatial and/or temporal complexity is not respected. For example by aggregating data over multiple minutes, or constructing spatial features aggregating 11 player positions into a single variable.
Preparing for a technical interview for a #DataScience position? These are some of the questions that typically allow me as an interviewer to quickly distinguish between juniors and mediors, including some quick tips 🧵. #Python#pythonprogramming#DataScientist#Jobs
All questions about SQL. Not the hardest thing to learn, but many #DataScientists only start to learn the value of SQL when they actually become part of a dev team. I’m not only talking about SELECT * FROM table, but also about joins, truncates, partitions and constraints.
Interacting with an API. Make sure you know your requests (GET, POST, PUT, DELETE, PATCH), as well as the #Python requests library.