“Data don’t lie”. But it typically requires a process of defining #research questions, hypotheses, methodology, interpreting and #dataviz that can introduce subjectivity and #bias. Scientific rigor and objectivity are key in #DataScience. Some #Tips for #DataScientists 🧵
Don’t dive straight into a dataset, domain knowledge is critical. Good #Science requires a theoretical understanding of a topic while #ignorance introduces bias. Sound domain knowledge enables you to ask the right questions and give relevant answers with #DataScience
Investigate the alternate hypothesis. Business questions asked to #DataScientists are often directive, as there already is a hypothesis. Don’t confirm this hypothesis without properly investigating the alternate option.
This is linked to confirmation bias. There is often so much information in a dataset, but we’re prone to focus on the parts confirming our initial hypothesis, overlooking other results. Don’t fall for this trap and conduct thorough #Analytics and #Science.
Try to identify the question behind the question. Most people tend to ask you relatively simple questions that are hard to generalize because they don’t understand the answers #DataScience could provide. They are specific but only exemplary of an overarching question.
Don’t make this a thought experiment though. Wandering of on your own induces bias. Instead try to close the communication gap with the business and discuss questions in depth, make #DataScience understandable and try to really understand the business problems.
With every decision you make, think about how your decision affects your results. Data you in- or exclude, preprocessing steps, the way features are constructed. You make dozens of decisions before training a single model, all potentially impacting results & explainability.
Watch out for statistical the significance trap. In #BigData, significance is easy to find. Look beyond the p-values though, as it’s rather about effect sizes and clinical/practical relevance. Know your statistical tests and the related effect size tests.
Make it stupid simple when you present data. Complex terminology and fancy #Datavisualization are great, and people will often trust you at first because it makes you look smart. This won’t last long however, as they will lose interest within minutes if they don’t understand you
Also, it’s easy to hide bias and ill-designed methodology behind complexity, as it does not invite easy criticism. Making it simple allows people to debate your approach. It makes you vulnerable, but people will respect and reward you for it, and it improves your #research
In the end, it’s about 3 things: awareness, thoroughness and simplicity. Awareness of the fact that every decision taken in a process can introduce subjective bias, being thorough in domain understanding & #research, and presenting results simple to prevent misinterpretation.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Tactical behavior in #Football has a spatial and a temporal component, and results from interaction with the opponent. It’s key to account for all these aspects in data-driven tactical analysis, as well as to respect the complexity of the temporal and spatial dimensions 🧵
Two years ago I published a systematic review in @EurJSportSci on using big data in #soccer for tactical performance analysis that illustrates the associated challenges and provides a data-driven scientific framework. #DataSciencetinyurl.com/mrxky6ca
The most common analysis issue is the fact that spatial and/or temporal complexity is not respected. For example by aggregating data over multiple minutes, or constructing spatial features aggregating 11 player positions into a single variable.
Preparing for a technical interview for a #DataScience position? These are some of the questions that typically allow me as an interviewer to quickly distinguish between juniors and mediors, including some quick tips 🧵. #Python#pythonprogramming#DataScientist#Jobs
All questions about SQL. Not the hardest thing to learn, but many #DataScientists only start to learn the value of SQL when they actually become part of a dev team. I’m not only talking about SELECT * FROM table, but also about joins, truncates, partitions and constraints.
Interacting with an API. Make sure you know your requests (GET, POST, PUT, DELETE, PATCH), as well as the #Python requests library.
#DataScientist in a software dev team and #pythonprogramming code for production pipelines? You should think carefully about scalability and integration. One of the things to consider is datatypes, here are some helpful tips 🧵
#Python is a dynamically typed language, but that doesn't mean you shouldn't care about types. Know you dtypes, from "str" to "bool" to "int8" to "float64", and understand their memory footprint and restrictions. Especially when working with larger objects, choose wisely.
Loose the strings. 9/10 times strings can be replaced by categoricals (Pandas) or even better by Enums (docs.python.org/3/library/enum…). This can reduce memory footprint of large dataframes with >30%, and improves performance.
Yesterday I shared a small thread about getting into #DataScience. Today I’ll build on that and share a bit about my own journey into sports analytics, specifically as a #DataScientist in the #football industry. 🧵
My path began with a MSc in Sport & Movement Science @VU_FBW. It’s not computer science or anything, but it does involve quite some #Math, #Statistics and #Physics, as well as a course in programming. Mainly it learned me Science, and gave me a lot of domain knowledge in sports.
I wasn’t planning to become a #DataScientist, but I wanted to work in sports. I did various stints as an embedded sports scientist, mostly internships/part-time, before joining @ZZLEIDENBASKETB. Those jobs involved data & science, but it wasn’t anything close to #DataScience.
Many young people ask me how they can become a #DataScientist, specifically in #football. Lately I have also seen a lot of posts on how to get into #DataScience in (1)50 days or so, which is a joke imo. Here is my realistic take on it. Warning: it will be closer to 1500 days. 🧵
#DataScience is an umbrella of roles & fields that require different competencies. But they all have two things in common: you have to know #Science and you have to be able to work with #data. The first requires learning to do research, the second learning to do #programming.
Go to uni and get a masters degree that at least requires some #math skills. I’m not saying you need a #PhD and 5 publications before calling yourself a #DataScientist, nor that you can’t be one without a MSc, but is helps a lot in acquiring the right competencies.