Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Dmitry Kobak

@hippopedoid

Sep 30, 2021 • 11 tweets • 6 min read • Read on X

Scrolly

So what's up with the Russian election two weeks ago? Was there fraud?

Of course there was fraud. Widespread ballot stuffing was videotaped etc., but we can also prove fraud using statistics.

See these *integer peaks* in the histograms of the polling station results? 🕵️‍♂️ [1/n]

These peaks are formed by polling stations that report integer turnout percentage or United Russia percentage. E.g. 1492 ballots cast at a station with 1755 registered voters. 1492/1755 = 85.0%. Important: 1492 is not a suspicious number! It's 85.0% which is suspicious. [2/n]

We can use binomial Monte Carlo simulation to find how many polling stations with integer percentages there should be by chance. Then we can compute the number of EXCESS integer polling stations (roughly the summed heights of all INTEGER PEAKS).

Resulting excess is 1300. [3/n]

https://twitter.com/hippopedoid/status/1278673265075118081

1300 clearly fraudulent stations is a lot! But it's not as many as in the last years, especially in 2020 (constitutional referendum). [4/n]

https://twitter.com/hippopedoid/status/1278673265075118081

Does it mean that there was less fraud this time? Not at all! But it seems it was less stupidly done.

Here is a 2D scatter plot of turnout vs. United Russia result. This suggests the actual result was ~30%, possibly a few % more, instead of the official 49.8%. [5/n]

Here is how this "comet" compares to the previous federal elections over the Putin era.

In terms of how many % points were added to the leader's result during counting, this election may actually have been the worst ever (but it's a close call with 2011). [6/n]

@MPchenitchnikov

See our series of papers (with Sergey Shpilkin and @MPchenitchnikov) regarding the methodology of integer peak calculations:

* projecteuclid.org/journals/annal…
* rss.onlinelibrary.wiley.com/doi/full/10.11…
* rss.onlinelibrary.wiley.com/doi/full/10.11…
* rss.onlinelibrary.wiley.com/doi/abs/10.111…

[7/n]

Just an example of how stupidly it _was_ sometimes done. This entire 2D integer peak with 75.0% turnout and 75.0% United Russia result (back in 2011) was due to one single city: Sterlitamak (in Bashkortostan). Obviously they did not even count the ballots. [8/n]

You can find all the data (in CSV) and my analysis code (as a Python notebook) at github.com/dkobak/electio…. The data have been scraped by Sergey Shpilkin. [9/n]

https://twitter.com/hippopedoid/status/1439897585783803914

Scraping the data was much more difficult this time, because it was deliberately obfuscated (see below). Of course eventually people wrote several de-obfuscators, e.g. see this very detailed write-up by Alexander Shpilkin: purl.org/cikrf/un/unfuc…. [10/10]

https://twitter.com/hippopedoid/status/1439897585783803914

Update: here is my new favourite plot on this topic. I pooled the data from all 11 federal elections from 2000 to 2021 and made a scatter plot of all 1+ million polling stations together. Just look at the periodic integer pattern in the top-right (i.e. fraudulent) corner! [11/10]

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @hippopedoid

Dmitry Kobak

@hippopedoid

Jun 21, 2024

How many academic papers are written with the help of ChatGPT? To answer this question, we analyzed 14mln PubMed abstracts from 2010 to 2024 and looked for excess words:

** Delving into ChatGPT usage in academic writing through excess vocabulary **

1/11 arxiv.org/abs/2406.07016

Many have noticed that ChatGPT likes the word "delve". Or "crucial". Or "intricate".

We checked ALL WORDS from ALL PubMed abstracts to find words with sudden increase in popularity in 2023-24.

For comparison, we did the same for ALL YEARS from 2010 onward.
2/11

We found LOTS of words with increased usage in 2024.

"Delves" got 25 times (!) more frequent. "Showcasing" and "underscores" -- 9 times more.

Among more common words, "potential" saw an increase of 4 percentage points; "crucial" and "findings" -- 3 perc. points.
3/11

Read 11 tweets

Dmitry Kobak

@hippopedoid

Apr 13, 2023

@ritagonmar

Really excited to present new work by @ritagonmar: we visualized the entire PubMed library, 21 million biomedical and life science papers, and learned a lot about --

THE LANDSCAPE OF BIOMEDICAL RESEARCH
biorxiv.org/content/10.110…

Joint work with @CellTypist and @benmschmidt. 1/n

We took all (21M) English abstracts from PubMed, used a BERT model (PubMedBERT) to transform them into 768D vectors, and then used t-SNE to visualize them in 2D.

We used the 2D map to explore the library, and confirmed each insight in 768D.

We focus on four insights. 2/n

Case study #1: Covid-19 literature.

When looking at the t-SNE map colored by publication year (yellow = newer papers), we immediately see a bright yellow cluster. A large cluster of related papers, all published in 2020-21. What could it be? 🤔

Of course it's Covid papers. 3/n

Read 12 tweets

Dmitry Kobak

@hippopedoid

Mar 30, 2023

@giffmana

We held a reading group on Transformers (watched videos / read blog posts / studied papers by @giffmana @karpathy @ch402 @amaarora @JayAlammar @srush_nlp et al.), and now I _finally_ roughly understand what attention does.

Here is my take on it. A summary thread. 1/n

Consider BERT/GPT setting.

We have a text string, split into tokens (<=512). Each token gets a 768-dim vector. So we have a 2D matrix X of arbitrary width. We want to set up a feed-forward layer that would somehow transform X, keeping its shape.

How can this be set up? 2/n

Fully-connected layer does not work: it cannot take input of variable length (and would have too many params anyway).

Only acting on the embedding dimension would process each token separately, which is clearly not sufficient.

How can we make the tokens interact? 3/n

Read 13 tweets

Dmitry Kobak

@hippopedoid

Dec 2, 2022

@FredHamprecht

A very long overdue thread: happy to share preprint led by Sebastian Damrich from @FredHamprecht's lab.

*From t-SNE to UMAP with contrastive learning*
arxiv.org/abs/2206.01816

I think we have finally understood the *real* difference between t-SNE and UMAP. It involves NCE! [1/n]

@jnboehm

In prior work, we (@jnboehm @CellTypist) showed that UMAP works like t-SNE with extra attraction. We argued that it is because UMAP relies on negative sampling, whereas t-SNE does not.

Turns out, this was not the whole story. [2/n]

https://twitter.com/hippopedoid/status/1285154214147235840

@jnboehm

Because UMAP uses negative sampling, its effective loss function is very different from its stated loss function (cross-entropy). @jnboehm showed it via Barnes-Hut UMAP, while Sebastian and Fred did mathematical analysis in their NeurIPS 2021 paper proceedings.neurips.cc/paper/2021/has… [3/n]

Read 8 tweets

Dmitry Kobak

@hippopedoid

Apr 26, 2022

@signmagazine

My paper on Poisson underdispersion in reported Covid-19 cases & deaths is out in @signmagazine. The claim is that underdispersion is a HUGE RED FLAG and suggests misreporting.

Paper: rss.onlinelibrary.wiley.com/doi/10.1111/17…
Code: github.com/dkobak/covid-u…

Figure below highlights 🇷🇺 and 🇺🇦. /1

What is "underdispersion"? Here is an example. Russia reported the following number of Covid deaths during the first week of September 2021: 792, 795, 790, 798, 799, 796, 793.

Mean: 795. Variance: 11. For Poisson random data, mean=variance. So this is *underdispersed*. /2

For comparison, during the same week US reported 1461, 1185, 1202, 1795, 2010, 2003, 1942 deaths. Mean: 1657. Variance: 135470. So this is *overdispersed*.

Overdispersion is not surprising: day-of-week reporting fluctuations, epidemic growth, etc.

But underdispersion is. /3

Read 11 tweets

Dmitry Kobak

@hippopedoid

Sep 23, 2021

@lpachter

Chari et al. (@lpachter) have updated their preprint and doubled down on their claim that an 🐘-looking embedding, a random (!) embedding, and 2D PCA, all preserve data structure "similar or better" than t-SNE.

I still think this claim is absurd. [1/n]

https://twitter.com/lpachter/status/1440695021502545934

https://twitter.com/hippopedoid/status/1437421945956470785

They literally say: "Picasso can quantitatively represent [local and global properties] similarly to, or better, than the respective t-SNE/UMAP embeddings".

In my thread below I argued it's a non-sequitur from Fig 2, due to insufficient metrics. [2/n]

https://twitter.com/hippopedoid/status/1437421945956470785

@lpachter

I argued that they should also consider metrics like kNN recall or kNN classification accuracy, where t-SNE would fare much better than these other methods.

I thought it should be obvious from this figure (using MNIST). But now @lpachter says it's a "mirage".

Is it? [3/n]

Read 12 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Dmitry Kobak

Try unrolling a thread yourself!

More from @hippopedoid

Dmitry Kobak

Dmitry Kobak

Dmitry Kobak

Dmitry Kobak

Dmitry Kobak

Dmitry Kobak

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!