Tweet

Dmitry Kobak

30 Sep, 11 tweets, 6 min read

So what's up with the Russian election two weeks ago? Was there fraud?

Of course there was fraud. Widespread ballot stuffing was videotaped etc., but we can also prove fraud using statistics.

See these *integer peaks* in the histograms of the polling station results? 🕵️‍♂️ [1/n]

These peaks are formed by polling stations that report integer turnout percentage or United Russia percentage. E.g. 1492 ballots cast at a station with 1755 registered voters. 1492/1755 = 85.0%. Important: 1492 is not a suspicious number! It's 85.0% which is suspicious. [2/n]

We can use binomial Monte Carlo simulation to find how many polling stations with integer percentages there should be by chance. Then we can compute the number of EXCESS integer polling stations (roughly the summed heights of all INTEGER PEAKS).

Resulting excess is 1300. [3/n]

https://twitter.com/hippopedoid/status/1278673265075118081

1300 clearly fraudulent stations is a lot! But it's not as many as in the last years, especially in 2020 (constitutional referendum). [4/n]

https://twitter.com/hippopedoid/status/1278673265075118081

Does it mean that there was less fraud this time? Not at all! But it seems it was less stupidly done.

Here is a 2D scatter plot of turnout vs. United Russia result. This suggests the actual result was ~30%, possibly a few % more, instead of the official 49.8%. [5/n]

Here is how this "comet" compares to the previous federal elections over the Putin era.

In terms of how many % points were added to the leader's result during counting, this election may actually have been the worst ever (but it's a close call with 2011). [6/n]

@MPchenitchnikov

See our series of papers (with Sergey Shpilkin and @MPchenitchnikov) regarding the methodology of integer peak calculations:

* projecteuclid.org/journals/annal…
* rss.onlinelibrary.wiley.com/doi/full/10.11…
* rss.onlinelibrary.wiley.com/doi/full/10.11…
* rss.onlinelibrary.wiley.com/doi/abs/10.111…

[7/n]

Just an example of how stupidly it _was_ sometimes done. This entire 2D integer peak with 75.0% turnout and 75.0% United Russia result (back in 2011) was due to one single city: Sterlitamak (in Bashkortostan). Obviously they did not even count the ballots. [8/n]

You can find all the data (in CSV) and my analysis code (as a Python notebook) at github.com/dkobak/electio…. The data have been scraped by Sergey Shpilkin. [9/n]

https://twitter.com/hippopedoid/status/1439897585783803914

Scraping the data was much more difficult this time, because it was deliberately obfuscated (see below). Of course eventually people wrote several de-obfuscators, e.g. see this very detailed write-up by Alexander Shpilkin: purl.org/cikrf/un/unfuc…. [10/10]

https://twitter.com/hippopedoid/status/1439897585783803914

Update: here is my new favourite plot on this topic. I pooled the data from all 11 federal elections from 2000 to 2021 and made a scatter plot of all 1+ million polling stations together. Just look at the periodic integer pattern in the top-right (i.e. fraudulent) corner! [11/10]

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @hippopedoid

Dmitry Kobak

@hippopedoid

23 Sep

@lpachter

Chari et al. (@lpachter) have updated their preprint and doubled down on their claim that an 🐘-looking embedding, a random (!) embedding, and 2D PCA, all preserve data structure "similar or better" than t-SNE.

I still think this claim is absurd. [1/n]

https://twitter.com/lpachter/status/1440695021502545934

https://twitter.com/hippopedoid/status/1437421945956470785

They literally say: "Picasso can quantitatively represent [local and global properties] similarly to, or better, than the respective t-SNE/UMAP embeddings".

In my thread below I argued it's a non-sequitur from Fig 2, due to insufficient metrics. [2/n]

https://twitter.com/hippopedoid/status/1437421945956470785

@lpachter

I argued that they should also consider metrics like kNN recall or kNN classification accuracy, where t-SNE would fare much better than these other methods.

I thought it should be obvious from this figure (using MNIST). But now @lpachter says it's a "mirage".

Is it? [3/n]

Read 12 tweets

Dmitry Kobak

@hippopedoid

13 Sep

@lpachter

I am late to the party (was on holidays), but have now read @lpachter's "Specious Art" paper as well as ~300 quote tweets/threads, played with the code, and can add my two cents.

Spoiler: I disagree with their conclusions. Some claims re t-SNE/UMAP are misleading. Thread. 🐘

https://twitter.com/lpachter/status/1431325969411821572

The paper has several parts and I have too many comments for a twitter thread, so here I will only focus on the core of the authors' argument against t-SNE/UMAP, namely Figures 2 and 3. We can discuss the rest some other time. [2/n]

In this part, Chari et al. claim that:

* t-SNE/UMAP preserve global and local structure very poorly;
* Purposefully silly embedding that looks like an elephant performs as well or even better;
* Even *untrained* neural network performs around as well.

[3/n]

Read 12 tweets

Dmitry Kobak

@hippopedoid

12 Jan

@KrishnaswamyLab

OK, I'll bite.

PHATE (nature.com/articles/s4158…) from @KrishnaswamyLab is like Isomap meeting Diffusion Maps: MDS on geo distances obtained via diffusion. Cool paper!

So let's test it on: (1) MNIST, (2) Tasic2018, (3) n=1.3mln from 10x. Does it work as well as promised? 🧐 [1/7]

https://twitter.com/KrishnaswamyLab/status/1346302977712254976

Here is MNIST.

PHATE finds the same 4/7/9 and 8/5/3 mega-clusters that are also emphasized by UMAP, but fails to separate some of the digits within mega-clusters, e.g. green & red (3 and 5) overlap a lot.

IMHO that's a clearly worse performance than t-SNE or UMAP. [2/7]

https://twitter.com/KrishnaswamyLab/status/1201937339574050822

Of course PHATE was designed for continuous data and that's where it's supposed to shine. But the original paper and tweets like this one and the one above make it look as if it hands-down outperforms t-SNE/UMAP for clustered data.

I'm unconvinced. [3/7]

https://twitter.com/KrishnaswamyLab/status/1201937339574050822

Read 7 tweets

Dmitry Kobak

@hippopedoid

10 Dec 20

@JanLause

In a new paper with @JanLause & @CellTypist we argue that the best approach for normalization of UMI counts is *analytic Pearson residuals*, using NB model with an offset term for seq depth. + We analyze related 2019 papers by @satijalab and @rafalab. /1

biorxiv.org/content/10.110…

Our project began when we looked at Fig 2 in Hafemeister & Satija 2019 (genomebiology.biomedcentral.com/articles/10.11…) who suggested to use NB regression (w/ smoothed params), and wondered:

1) Why does smoothed β_0 grow linearly?
2) Why is smoothed β_1 ≈ 2.3??
3) Why does smoothed θ grow too??? /2

The original paper does not answer any of that.

Jan figured out that: (1) is trivially true when assuming UMI ~ NB(p_gene * n_cell); (2) simply follows from HS2019 parametrization & the magic constant is 2.3=ln(10); (3) is due to bias in estimation of overdispersion param θ! /3

Read 12 tweets

Dmitry Kobak

@hippopedoid

21 Oct 20

@jhnhw

Remember the galaxy-like UMAP visualization of integers from 1 to 1,000,000 represented as prime factors, made by @jhnhw?

I did t-SNE of the same data, and figured out what the individual blobs are. Turns out, the swirly and spaghetti UMAP structures were artifacts :-(

[1/n]

@jhnhw

Here is the original tweet by @jhnhw. His write-up: johnhw.github.io/umap_primes/in…. UMAP preprint v2 by @leland_mcinnes et al. has a figure with 30,000,000 (!) integers.

But what are all the swirls and spaghetti?

Unexplained mystery since 2008. CC @ch402. [2/n]

https://twitter.com/jhnhw/status/1031829726757900288

The input here is a 1,000,000 x 78,628 matrix X with X_ij = 1 if integer i is divisible by the j'th prime number, and 0 otherwise. So columns correspond to 2, 3, 5, 7, 11, etc. The matrix is large but very sparse: only 0.0036% of entries are 1s. We'll use cosine similarity. [3/n]

Read 11 tweets

Dmitry Kobak

@hippopedoid

20 Jul 20

@jnboehm

New preprint on attraction-repulsion spectrum in t-SNE => continuity-discreteness trade-off!

We also show that UMAP has higher attraction due to negative sampling, and not due to its loss. 🤯 Plus we demystify FA2.

With @jnboehm and @CellTypist.
arxiv.org/abs/2007.08902 [1/n]

@GCLinderman

We get the spectrum by changing the "exaggeration" in t-SNE, i.e. multiplying all attractive forces by a constant factor ρ. Prior work by @GCLinderman et al. showed that ρ->inf corresponds to Laplacian eigenmaps. We argue that the entire spectrum is interesting. [2/n]

Stronger attraction preserves continuous manifold structure. Stronger repulsion brings out discrete cluster structure.

Here is a toy dataset with 20 Gaussians arranged on a line, like a necklace. With LE one sees the string. With t-SNE one sees the individual beads. [3/n]

Read 10 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Share this page!

Dmitry Kobak

Try unrolling a thread yourself!

More from @hippopedoid

Dmitry Kobak

Dmitry Kobak

Dmitry Kobak

Dmitry Kobak

Dmitry Kobak

Dmitry Kobak

Did Thread Reader help you today?

Like this author's thread?