hello #psychology #stats twitter!

a while ago i promised some graphs on why #removing #outliers using a simple cut-off (eg, >2SDs) is a #bad idea

so that I can sleep at night again, here they are

tldr: DON'T (blindly) USE FIXED OUTLIER CUT-OFFS LIKE >2SD. EVER.

1/15
for some reason, it's very common in #psychology to remove 'outliers' from data

most common way: exclude data more than two standard deviations from mean

we spend time & money collecting data, then throw 5% away

🤷

I don't know why, or where it's taught, but there it is

3/15
some arguments in favour are that it:

- 'cleans' the data 🧹
- 'removes noise' 🔊
- 'improves signal to noise' 📶

and these all sound like *good* things to do, right?

4/15
well, not necessarily (see rest of thread)

some justifications for doing it might be:

- everyone does it 👨‍👩‍👧‍👦
- SPSS showed me outliers, so I HAD TO ACT 🚔
- I was taught to do it 🧑‍🎓
- [...insert your justification here...]

5/15
in my first real paper (doi.org/10.1016/j.neul…), I was told by a reviewer (THANK YOU!), that:

'what you're doing is 'fishy'. fixed cut-off outlier elimination biases the data; you should use a principled method, eg, of van Selst & Jolicoeur (1994)'

6/15
doi.org/10.1080/146407…
since that fateful day, I've been repeating that whenever I can to whichever outlier-excluding author is forced to listen 📢🙉

now, to reassure myself of the truth of this outlier-biasing-data factoid, I've run my own simulations 🧑‍💻

they are enlightening! ⛅️

7/15
long-story short: with a >2SD cut-off, the remaining sample, relative to the *true* population:

- has a lower standard deviation
- has a more 'wandering' mean

& this:
- *increases false-positive differences* between the new sample mean & eg, the true population mean

oops

8/15
this shows simulations of data, M=0, SD=1

half the data has a difference d, of 0 to 1 added (a 'true effect')

y=how often a p<.05 'result' is found
x=outlier removal criterion
lines=Ns

if outlier removal has no effect, all lines are flat

when d=0, ~10% false-positives w >2SD
too abstract? take 'IQ', which has a nice symmetrical mean=100 & SD=15

- sample 20 people from the population
- remove 'outliers' >2SD from sample mean
- the final mean will differ significantly (lower or higher, p<.05) from true population about 10% of the time

😱

9/15
that's right, by removing 'outliers' >2SD, you've ~DOUBLED the false-positive probability of your sample differing from the true population mean.

doubled. x2

and, you can throw away all subsequent analysis, because the data are now biased.

not good. not 'cleaner' 🧹

10/15
THAT'S JUST FOR SYMMETRICAL DISTRIBUTIONS!

if your distribution is asymmetrical (eg reaction times, percentage correct), outlier removal *also changes the means*

eg, human RTs are positively-skewed, approximately log-normal, so log(RT) gives approx normal distribution...

11/15
(in these graphs, the red lines are the means and the blue the medians, showing the skews - positive skew when red is bigger than blue, negative skew when red is smaller than blue)
percentage correct is likely negatively-skewed with ceiling effect at 100%, so you could do, eg, logistic transform on proportions = log(p/(1-p))

if you don't 'normalise' asymmetrical distributions, then removing outliers using fixed cut-offs can push the means up or down

12/15
in quite unpredictable ways!

here's the same graph as before - how many simulations show p<.05 - this time for positively-skewed (simulated) RT data. it's wild!

remember: flat lines are good lines...
WHAT ABOUT *REAL DIFFERENCES* BETWEEN CONDITIONS?

here's where it gets FUN!

assuming a real difference between first & second half of data (eg 2 conditions or groups) then:

removing outliers *decreases the probability of detecting that difference*

not good. *noisier*🔊

13/15
that's right - if there is a real difference in your data, and you remove outliers by pooling across the two conditions or groups, then it doesn't 'clean' your data at all, it makes it 'dirtier'!

😱

#StatsShocker

14/15
Conclusions:

1. don't remove outliers using >2SD from mean. EVER

2. if you MUST* remove 'outliers', then >3SD seems much less biasing**, so use that?

* eg supervisor/reviewer/book/lecturer forces you

** so less that it's probably not worth doing at all. so maybe don't?

15/15
and don't just take my word for it, here's a true god of reaction times, Jeff Miller, who said, 30 years ago, that outlier removal is:

"very dangerous" 🐯

16/15

psycnet.apa.org/doi/10.1080/14…
there may be other good reasons to remove data:

- the experiment didn't work
- the person didn't understand the task
- the person performed at chance/floor/ceiling
- [...insert interesting cases...]

on these interesting cases, i have no comments at this time.

questions?

17/15
the Matlab/Octave code i used is here:
neurobiography.info/projects/outli…

I'll be happy to make any changes or clarifications or retractions when (proper) statisticians tell me i've got it wrong :-)
a post-hoc addition:

this is NOT a call to just switch from 2 to 3 SD & all is fine!

see various other threads & comments & papers for more nuance.

eg here's a paper in PNAS which removed outliers >3SD in so many different ways it's physically painful!

doi.org/10.1073/pnas.2…
if you liked this, you'll love my podcast 😁

theerrorbar.com
hello #stats #nerds!

after 24 hours, many comments & a long walk, some appendices:

1. these data only apply to between-participants considerations. an extra/different layer would be to do/not do outlier removal first on individual P's data, then look at effects on the group...
2. When I talked about 'false positives', it was about how your sample mean may or may not reflect the (true) population mean; eg, for IQ, after removing >2SD outliers, you would conclude ~10% of the time, that the population mean is NOT 100 (when it really is). it should be 5%!
3. the graphs I showed were only for the 'raw' data and the 'summary' data - I missed out perhaps the most important, intermediate, graphs 🤦- showing what each individual distribution looks like.

So, the next 3 graphs show some data BEFORE & AFTER removing >2SD outliers...
3A. Normal distribution, M=0, SD=1, N=20 per sample, 10K samples

from top: Means, SDs, t-scores, p-values
red line=expected (mean) value

see how the p-values become more likely 'significant' (at both lower and upper tails) 👀
3B. log-normal 'reaction time' distribution, M=690ms, SD=215ms, N=20 per sample, 10K samples

Same effects as in purely-normal data.

Note that even before removing any outliers, the distribution of means is not quite Normal, and removing 'outliers' just makes it worse...
3C. logistic 'proportion correct' distribution, M=.933, SD=.064, N=20 per sample, 10K samples

as above - non-normal distribution of means to start with, made much worse by removing 'outliers'!
(i also found a few relatively small errors / bugs in my code, so that's a bit cleaner now - same link as above)

[that's all folks]

I'm on a twitter and actual holiday now for two weeks - bye!

🚗+🏞️=😎

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Dr Nick Holmes 💚🌶💉💉

Dr Nick Holmes 💚🌶💉💉 Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @TheHandLab

3 Sep
this #episode was a real pleasure!

Valentine Delrue @valentinedelrue & Sean Devine from @jtrialerror talking about their journal & project to make #science less fragmented, more open, honest & transparent. 22mins; links below...

#scicomm #journal #OpenScience #errorbarpodcast
Read 6 tweets
1 Apr
It's Good Friday.

As a sacrifice to Science, I'm going to criticise every single paper I've ever written, one tweet at a time.

Stay tuned..!

#ImAVeryNaughtyBoy
Paper #1: Holmes & Spence 2004 @xmodal

Criticism: A non-systematic review of selected papers, no assessment of study quality, bias, or effect sizes. The only novel thing in this paper was a long, harsh critique of Iriki et al. (1996)

#SelfCriticism

doi.org/10.1007/s10339…
Paper #2: Skaliora et al. 2004

Criticism: I contributed only one figure (#8) to this paper, but >30 neonatal rats paid the ultimate price for my crappy patch-clamping skills (only 17 cells) as an MSc student. I still feel the guilt.

#SelfCriticism

doi.org/10.1152/jn.004…
Read 60 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!

:(