Tweet

Women in Statistics and Data Science

28 Sep, 18 tweets, 9 min read

Missing Data, a thread ⬇️
Missing values are everywhere! We have listed more than 150 R packages in cran.r-project.org/web/views/Miss…
So let us give few pointers:
The method of handling missing data depends on the purpose of the analysis: estimation, completion, prediction, etc.

1) For inference with missing values, estimating as well as possible a parameter and giving a confidence interval, consider likelihood approaches (using EM algorithms) or multiple imputation

2) Single Imputation/Matrix completion aims at completing (predicting the missing entries) a dataset as best as possible. Multiple imputation aims at estimating parameters and their variability, taking into account the uncertainty due to missing values

3) Imputing by the mean is the worst thing you can for inference but can be acceptable for prediction with missing data in the covariates (with lots of data, powerful learner and imputing train and test by the same constant) - see, arxiv.org/abs/1902.06931

4.1) Should I delete a variable with >40% missing values? NO! Many of my clinical colleagues use this rule. It is not only the percentage that counts, but also the structure of the data

4.2) Imagine data with all variables perfectly correlated, even with 80% missing data, you can perfectly predict the missing values from the observed values. On the other hand, with very poorly correlated variables, even a small percentage of missing data can be problematic

5) Give me an incomplete data, I easily output a completed data (> 100 imputation methods); properties depend on the imputation model used: Impute with random forest, you can not extrapolate, Impute with linear models, you can not consider non linear relationships, etc..

6) What confidence should be given to an analysis performed from incomplete data ? Multiple Imputation can help you assessing the variability+ As in any data analysis, you should also consider descriptive statistics

@imkemay

7) Missing data in causal inference: be careful, results very sensitive to the identifiability assumptions and the distribution of missing values. For pipelines in R to estimate ATE with IPW, AIPW see my former PhD student @imkemay ‘s notebooks on Rmistatic

@imkemay

@imkemay 8) Missing data in linear and logistic regression: misaem R package implements estimation of the regression coefficients and prediction with missing values - for now only with continuous covariates and MAR, MCAR data for those familiar.

@imkemay

@imkemay 9) Missing data in random forests: Everyone says that trees handle missing data, but not all methods are equivalent! We recommend for prediction missing incorporated in attributes, implemented in the R package grf.

@imkemay

@imkemay 10) Missing data in PCA: visualisation with the R package missMDA, also for categorical variables, group of variables, etc.

@imkemay

@imkemay Amazing women developing methods for missing values: @madeleineudel (low rank methods), @raziehnabi (identifiability with graphical models), Shu Wang (weighting, multiple imputation, fractional imputation),@LeyClem (causal inference),

@imkemay

@imkemay @raziehnabi @LeyClem @N_Erler (longitudinal data, bayesian approach and amazing teaching resources), Marine Le Morvan (Supervised learning, neural nets) and of course so many others.

@imkemay

@imkemay @raziehnabi @LeyClem @N_Erler Citations:
“One of the ironie of big data is that missing values play an even more significant role” (arxiv.org/abs/1906.12125).
“The idea of imputation is both seductive and dangerous (Dempster and Rubin, 1983)”

@imkemay

@imkemay @raziehnabi @LeyClem @N_Erler It was first elements, questions I am often asked and methods I use. Feel free to reach out as we develop tools for users and we are glad if it may help other scientists analyzing their data.

@imkemay

@imkemay @raziehnabi @LeyClem @N_Erler Today I present my project of a joint team, PreMeDICaL (Precision Medicine by Data Integration and Causal Learning), between @inria_sophia and @Inserm (IDESP). We are keeping our fingers crossed. See you tomorrow.

@imkemay

@imkemay @raziehnabi @LeyClem sorry typo: @madeleineudell

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @WomenInStat

Women in Statistics and Data Science

@WomenInStat

27 Sep

I have been working for >10 years on missing data.
My passion for data science mainly comes from its transversality: as a statistician, we can interact with so many scientific fields!
With missing data the same is true but within statistics, as it can pop up in all its branches.

When I first meet a scientist for a new project, I always start the conversation by asking “Show me the data!” to understand the underlying challenges.
So far, I have never been shown a complete dataset... (of course there might be some bias!).

@imkemay

With @imkemay, Aude Sportisse, @nj_tierney and @Natty_V2, we created the Rmistatic platform rmisstastic.netlify.app, to organize all the resources (courses, tutorials, articles, software, etc.) and implement analysis pipelines with missing data in R/Python.

Read 4 tweets

Women in Statistics and Data Science

@WomenInStat

30 Apr

I’m going to begin today with a bold claim: Being an applied statistician is a lot like being an ethnographer.

I say this both based upon years of experience working in collaborative projects and consulting and based on my experience studying ethnography. (Recall: before my PhD in statistics, I started and quit a PhD in sociology).

Very often a question asked is not the ‘real’ question at hand. Typically, the person asking has a sense of the problem, but may not know exactly how to ask the question.

Read 12 tweets

Women in Statistics and Data Science

@WomenInStat

28 Apr

Yesterday I tweeted about nested data, with multi-level models (MLM) versus OL + cluster-robust variance estimation (CRVE). This made me think about another confusion that arise, between what are called fixed versus random effects.

Let’s begin with a simple relationship between a covariate X and Y in nested data, e.g. students i nested in school j. We are interested in understanding the relationship between X and Y at the student level.

Approach 1: Assume the schools are fixed, but that students are a random sample within these schools. Assume the relationship between X and Y is the same in all schools. This often amounts to including a dummy variable for each school in the model. Here I use OLS to estimate β_1.

Read 8 tweets

Women in Statistics and Data Science

@WomenInStat

27 Apr

I work primarily with nested data. One example is in experiments, with students nested in schools. Another is meta-analysis, with effect sizes nested in studies. In this thread, I’ll focus on students nested in schools, but this applies more generally.

Question 1: Do you need to take nesting into account in your analysis? Our world is naturally nested – students in classrooms in teachers in schools in districts and so on. Does this mean we need to take all of these levels into account? No.

Nesting only needs to be accounted for if it is part of how our sample of data is generated – either how the data is selected (sampled) or the who gets an intervention being studied (assignment).

Read 19 tweets

Women in Statistics and Data Science

@WomenInStat

26 Apr

Hello everyone – I’m so excited (and nervous!) to get to tweet with you all this week. I’ll start by telling you some general things about myself.

I’m an Associate Professor of Statistics at Northwestern University and a Faculty Fellow at the Institute for Policy Research. I also Co-Direct the Statistics for Evidence-Based Policy and Practice Center. For more info see here: bethtipton.com

I call my field “Social Statistics” and I much of what I study has to do with the role of statistics in the creation and use of evidence for decision making, particularly in the field of education research.

Read 13 tweets

Women in Statistics and Data Science

@WomenInStat

23 Apr

The #DataFeminism book also made me look inward and examine my own biases, which I am exceedingly grateful for.

Namely, it forced me to reckon with some of my fundamental operating assumptions as a statistician & data scientist.

Examples threaded below...

In chapter 3, the authors discuss the role of emotion in data visualization, specifically calling out giants in the field like Edward Tufte and Alberto Cairo (no snitch tagging, please) for what is presented as an anti-emotion stance.

On Tufte: "Any ink devoted to something other than the data themselves ... is a suspect and intruder to the graphic. Visual minimalism, according to this logic, appeals to reason first. ... Decorative elements ... are associated with messy feelings ... and emotional persuasion."

Read 12 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!