How clean is too clean?
Presenting...Molecular Cross-Validation
biorxiv.org/content/10.110…
How big do you make the neighborhood? How long do you diffuse?
But why 20 PCs? Why not 3? Or 50?
![](https://pbs.twimg.com/media/EFxSCAJVUAABer_.jpg)
Parameters matter here, and it’s hard to tell how tweaking things like learning rate, bottleneck width, or the random seed will affect the data.
![](https://pbs.twimg.com/media/EFxSGfPUUAIapGW.png)
In classic cross-validation, you would split the cells into two groups, fit the model on the first group (training), and evaluate its accuracy on the second group (validation).
![](https://pbs.twimg.com/media/EFxSjASUYAEvE0_.jpg)
![](https://pbs.twimg.com/media/EFxSqyTU4AAGzNk.jpg)
The proof is 8 lines; check out the methods section if you’re interested.
![](https://pbs.twimg.com/media/EFxSwl1U8AAFQZq.jpg)
MCV lets you calibrate any model (we do PCA, diffusion, and a deep autoencoder), and pick the best one for your dataset.
![](https://pbs.twimg.com/media/EFxS17aVUAAqX80.jpg)
![](https://pbs.twimg.com/media/EFxTNNeU8AAXwBu.png)
If you make software which does denoising, we’d love to get MCV inside of it. Looking at you @satijalab, @theislab.
It was fun to see this story fill out and mature from the sketch in #Noise2Self, seeing how far the pixel:molecule analogy could take us.
arxiv.org/abs/1901.11365