Here, p and ε are known parameters (inputs), the goal is to be correct with probability at least 99/100 (over the random coin flips). As we will see, 99/100 is sort of arbitrary, any constant in 1/2 < c < 1 would work.
Also, warning: I'm going to give DETAILS.
2/
Let's start with Q1, the "fair" vs. ε-biased coin setting (p=1/2). As more than 66% of you answered, for Q1. the number of coin flips necessary and sufficient is then Θ(1/ε²). Why?
One way to think about it to gain intuition is "signal to noise."
If you take n coin 🪙 flips, the expectation of the number of Heads you see is either n/2 or n(1/2+ε). So the gap b/w the two is Δ=nε.
The *variance* of that number is np(1-p)=n/4 or n(1/4-ε²)≈n/4. So the standard deviation in both cases is σ≈√n/2.
Now, things will...
4/
.. typically fluctuate around their expectation by a few stds, so to be able to distinguish b/w the two cases we expect (!) to need Δ >> σ, that is nε >> √n. Reorganise, and you get n >> 1/ε².
Here's the intuition: let's make it formal! For the upper bound, the algorithm...
5/
... just follows the above. Count the number of Tails 🪙, threshold at T=n/4+Δ/2. To prove it works, use Chebyshev's inequality to get what the above argument states: if n >> 1/ε², the std is small compared to the gap Δ, and that thresholding works w/ high probability. Done!
6/
For the lower bound... Well, I have 3 options for you. All start by the standard fact that, given *one* sample from either P or Q one cannot distinguish b/w two distributions P,Q w/ proba of success greater than TV(P,Q) (up to some csts).
Since n i.i.d. coin 🪙 flips from a Bernoulli P correspond to *one* sample from the product distribution Pⁿ (over {0,1}ⁿ), to distinguish we need the # of coin flips n to satisfy
TV(Pⁿ,Qⁿ) > 49/50
(don't focus on the constant) where P=Bern(1/2) and Q=Bern(1/2+ε).
8/
So our new goal: upper bound TV(Pⁿ,Qⁿ). A first idea 💡 is to use subadditivity of TV for product distributions:
TV(Pⁿ,Qⁿ) ≤ n TV(P,Q)
and since TV(P,Q)=ε (easy here, Bernoullis) we get (ignoring some constants) the lower bound n > 1/ε. That's... not great, we want 1/ε². 9/
💡Second attempt: subadditivity of TV distance is not tight, that's a well-known issue with TV. Let's use something else as a proxy, which plays well (technically, "tensorize").
Hell, let's do Hellinger distance: that's H(P,Q)=(1/√2)‖√P-√Q‖₂, which is weird but why not.
10/
Basically the ℓ₂ distance b/w square roots of the probability mass functions (TV was ℓ₁ b/w the pmfs). We conveniently know that TV(P,Q) ≲ H(P,Q), and that the *squared* Hellinger is subadditive*
*actually, even a bit better: we have an exact expression, but whatever.
💡Second option: you don't want to go to Hell(inger), you're more of an information-theorist. So... Kullback–Leibler divergence! Which is (sub)additive for product distributions: go tensorization! 🎊
KL(P₁⊗P₂,Q₁⊗Q₂)=KL(P₁,Q₁)+KL(P₂,Q₂)
13/
So we do the same thing and use Pinsker's inequality:
1 ≲ TV(Pⁿ,Qⁿ)² ≲ KL(Pⁿ,Qⁿ) ≤ n KL(P,Q) = Θ(nε²)
using that KL(P,Q) = 2ε²+o(ε²) – again a relatively simple computation for the KL b/w two Bernoullis: wolframalpha.com/input/?i=Serie…)
Again our n = Ω(1/ε²) lower bound. 🥂
14/
Last method: let's forget proxies and keep TV. Since, by the data processing inequality, TV distance can only go up by post-processing, the TV b/w the distributions of the *sum* of the n coin 🪙 flips is at least the TV b/w the distributions of the *tuples* of n flips. So..?
15/
The latter is TV(Pⁿ,Qⁿ), but the former is the TV b/w two binomials! So we get
1 ≲ TV(Pⁿ,Qⁿ) ≲ TV(Bin(n,1/2), Bin(n,1/2+ε))
At this point, you pray the TV distance b/w Binomials is known or you use some Gaussian 🔔 or Poisson 🐟 approximation and hope for the best.
16/
Good thing: it *is* know that
TV(Bin(n,1/2), Bin(n,1/2+ε)) ≍ ε√n
in our regime* and so we're good! Again, we get our n = Ω(1/ε²) lower bound. 🥂
So we have our upper and matching sample complexity lower bounds, Θ(1/ε²), and hopefully a bit of intuition about why it holds. That was the first question.
Yep, we still have 3 more to go! (It'll go faster though, I swear.)
18/
Q2 asked about an extremely biased coin: either always Tails, or *almost* always Tails. Here too, roughly 66% of you got the right answer: now Θ(1/ε) 🪙 flips are necessary and sufficient to distinguish b/w the two cases.
The proof is much easier, too!
Think of it as a Geometric r.v. with parameter ε, for instance: you need roughly 1/ε to see even *one* Heads with decent probability. Unless then, nothing you can say!
(Conversely, as soon as you see a Heads, you know it's not the p=0 coin 🪙.)
What about Q3, where neither of the two coins is "trivial"? What if we want to distinguish b/w a coin w/ parameter 10ε and one w/ parameter 11ε, and be correct 99% of the time?
This one is tricky, ≈21% of you selected the right answer: Θ(1/ε) flips.
To see why, let's... not answer the question. Instead, let's discuss the more general statement below:
Claim. To distinguish b/w a coin w/ parameter α and one w/ param α(1+β), Θ(1/(αβ²)) coin flips are necessary and sufficient.
This implies the above for α=10ε and β=1/10.
22/
Let's start with the upper bound: the same type of 𝔼+variance+Chebyshev analysis as in Q1 will do the trick. To see why, consider our signal-to-noise argument:
- the gap in 𝔼 b/w the 2 cases is Δ=nαβ
- the std in both cases (β<<1) is ≈√(nα).
We need nαβ >> √(nα), done.
23/
For the lower bound? Let's adapt our KL divergence proof. We again need
1 ≲ TV(Pⁿ,Qⁿ)² ≲ KL(Pⁿ,Qⁿ) ≤ n KL(P,Q)
and now KL(P,Q)=Θ(αβ²), so n must be Ω(1/(αβ²)) Hurray! 🥂
But that's an argument, maybe we can have more intuition?
Think of it this way: with probability 2α, you get to toss a coin 🪙 which has parameter either 1/2, or (1+β)/2. But with probability 1-2α, you don't toss anything, you just get Tails.
In that thought experiment...
25/
You do get either (overall) a 🪙 with parameter either 2α×1/2=α, or 2α×(1+β)/2=α(1+β). But in that new setting it's clearer why we get Θ(1/(αβ²)): only every 1/α flips or so, you get *one* flip from the "useful" coin which is either biased or fair.
So we get (1/α)×Θ(1/β²)!
26/
Time to wrap up and handle the last, Q3: "this is all good, but what if 99% success probability isn't good enough for me, and instead I want 1-δ for arbitrarily small δ>0?"
... multiply all the bounds we discussed by a factor log(1/δ), and you're good.
How so, you ask? The upper bound is quite fun and useful: suppose you have an algo taking n samples, and with success probability 99%.
Then set t=O(log(1/δ)), repeat that algo independently...
28/
... t times on indept samples (so tn=O(nlog(1/δ)) samples overall), and take a majority of the t outputs (which are just bits, since we're doing testing).
By Chernoff/Hoeffding, this gives the right answer w/ probability at least 1-δ: the success probability was "amplified." 29/
For the lower bound, now.. you remember our KL-based lower bound, on tweet #14 of the 🧵? We used Pinsker's inequality to relate TV to KL.
We can use another inequality instead, and we'll get the log(1/δ) in the lower bound "for free." 🤯
Cf. below+one line of algebra...
30/
... specifically noting that, reorganizing, we get
e^(-Θ(nε²)) ≤ 2δ-δ²
hence the final bound.
More on Pinsker vs. Bretagnolle–Huber:
🧵 This thread:
I hope you enjoyed this week's quiz on 🪙, and the (admittedly, a bit long) discussion. Let me know below if you have any question, or comment, or feedback!
32/end
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Note that this is true for only two r.v.'s: if X~X' and Y~Y'
𝔼[X+Y]=𝔼[X'+Y']
whenever 𝔼[X+Y] is defined (regardless of whether 𝔼[X] and 𝔼[Y] are defined). (cf. below from G. Simons (1977)).
Turns out, this fails for 3 r.v.'s (same paper by Simons, and Counterexample 6.1 in Stoyanov's book).
Sometimes 𝔼[X+Y+Z], 𝔼[X'+Y'+Z'] both exist, yet 𝔼[X+Y+Z]≠𝔼[X'+Y'+Z'].
(Again, for 2 r.v.'s the result holds: if 𝔼[X+Y], 𝔼[X'+Y'] both exist, then 𝔼[X+Y]=𝔼[X'+Y'].)
Stuff I wish I had known sooner: "Pinsker's inequality is cancelled," a thread. 🧵
If you want to relate total variation (TV) and Kullback-Leibler divergence (KL), then everyone, from textbooks to Google, will point you to Pinsker's inequality: TV(p,q) ≤ √(½ KL(p||q) )
1/
It's warranted: a great inequality, and tight in the regime where KL→0. Now, the issue is that KL is unbounded, while 0≤TV≤1 always, so the bound becomes vacuous when KL > 2. Oops.
2/
*I'm using it with the natural log, by the way, so no pesky log 2 constants.
This is annoying, because in many situations, you would want to bound a TV close to 1: for instance, in hypothesis testing: "how many samples to distinguish between an ε-biased and a fair coin, with probability 1-δ?"
Well, you'd like to start bounding 1-δ ≤ TV ≤ ?.
3/
📊 Answers and discussion for yesterday's quiz on sequences (and their names).
Before we go to the first question, again: if you haven't, bookmark oeis.org. It's useful: the solution to some of your problems may be recorded there!
... ascent sequences avoiding the pattern 201 (oeis.org/A202062)? Or the number of set partitions of [n] that avoid 3-crossings (oeis.org/A108304)? Who am I to judge?
A paper with a broken and ill-defined "solution" to privacy in ML gets a $50k prize?! While said solution was already fully broken at the time of the award—on a challenge chosen by the authors?🤦♂️
The presentation slides must have been hella convincing.
Will Carlini et al. get a share of the prize money for breaking the system prior to the award ceremony? arxiv.org/abs/2011.05315
Disclaimer: I am in no way affiliated w/ any of the authors of the attack, nor w/ the authors of the original paper, nor w/ basically anyone involved. I'm just dismayed at the signal sent, & the overall waste—so much good work done elsewhere, and this is what gets the highlight.
📊 It's Thursday, time for our weekly (but weakly so) quiz! This week: integers, and remarkable sequences.
See this as an extended ad for the OEIS, if you will. Also, if you don't know the OEIS: go check it out (and come back)! oeis.org
1/
Let's start with a "fun" one. To be unhelpful, let's say as a hint that I'm expanding exp(e^x-1) as a series around 0, and am interested in the coefficients
Q1: Guess what number comes next—maybe this sequence will ring a bell? [There might be a trap]
1, 1, 2, 5, 15, 52, ??
2/
We won't do too many of those "guess what's next" games though, since they are mostly meaningless. But let's mention that, of course, @xkcdComic has a comic about integer sequences, because XKCD is basically Borges' library for the Internet. xkcd.com/2016/
3/
📊 It's Thursday for many, time for this week's quiz! As alluded earlier, we're going to flip coins. Like, a lot.
Today: you have a biased coin which lands heads with some probability p∊(0,1), and you can toss it as many times as you want. You don't know p, though...
1/10
... which is a real shame, since you want to simulate *another* biased coin which lands heads w/ proba exactly f(p), where f: (0,1)→(0,1) is fixed.
(Also, you'd like to do this efficiently, thank you very much.)
Let's start with a simple function: f(p)=1/2.
Q1: Can you?
2/10
OK, so that was the warmup. Let's formalize a bit more what "efficiently" means here. For us, it will mean (1) with a number of coins finite almost surely (no infinite sequence of tosses), and (2) with a finite automaton.