Thread by @Heinonmatti on Thread Reader App

Say you want to figure out which beliefs to target in a behaviour change campaign, and as part of the evaluation look at correlations between two self-reports, like beliefs and intentions:

A Tale of Non-linearity 🧵👇

1/

In the process of Confidence-Interval Based Estimation of Relevance (CIBER) you aim to find variables that are both a) correlated with something more "downstream" (such as behaviour or behavioural intentions), and b) changeable (not maxed out already)

2/

ncbi.nlm.nih.gov/pmc/articles/P…

It's not uncommon to end up with highly skewed distributions. This doesn't of course always happen, but it does sometimes, even though people try to craft their questions such that the middle answer is the most common, and the rest are symmetrically less so.

Real data:

3/

Now, what happens when you take a correlation from two variables with a disproportionate number of people answering "7" on a scale of 1-7 (i.e. "extremists"), and everyone else answering randomly?

Something @nntaleb called "Dead Man Bias".

Simulation:

4/

In the case of the real data presented earlier, the authors ended up choosing the underlined variable, as it was both correlated and changeable. There was ~30% of people answering 7.

The regression line shows you how well the sample is described by the correlation...

5/

You can see that only the {7, 7} folks are well described by the correlation. Positive correlation is seen in the upward slope of the line. In the left panel there is the real data, in the right is data where {7, 7} is kept as is, and everyone else's answers are shuffled.

6/

The original correlation of 0.31 remains as it is, even if all non-extremists answer randomly!

You make a naive demonstration by removing all pairs with a 7 (right), or the {7, 7} extremists (left).

7/

Maybe it's still an adequate description of the data generation process. Still, correlation doesn't seem the right tool for the job.

8/

In samples of 1000 (left), the effect is clearer than in samples of 250 (right). But information-based measures still outperform correlations. Surprised to see Spearman perform even worse, although I should've believed Nassim.

9/

There's a nice blog post on the topic by @DavidSalazarVir, with #rstats code. Look under "Correlation under non linearities".

10/

david-salazar.github.io/2020/05/22/cor…

As a general note, avoiding skewed distributions with subgroups is a good idea if you need to use linear tools made for homogeneous populations.

But maybe you want to do stuff with diverse types of data 🤷‍♂️

Quick demo based on NNT's recommendation:

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Share this page!

Enter URL or ID to Unroll