#Python tip: Given inexact data, subtracting nearly equal
values increases relative error significantly more
than absolute error.
4.6 ± 0.2 Age of Earth (4.3%) 4.2 ± 0.1 Age of Oceans (2.4%)
___ 0.4 ± 0.3 Huge relative error (75%)
This is called “catastrophic cancellation”.
1/
The subtractive cancellation issue commonly arises in floating point arithmetic. Even if the inputs are exact, intermediate values may not be exactly representable and will have an error bar. Subsequent operations can amplify the error.
1/7th is inexact but within ± ½ ulp.
2/
A commonly given example arises when estimating derivatives with f′(x) ≈ Δf(x) / Δx.
Intuitively, the estimate improves as Δx approaches zero, but in practice, the large relative error from the deltas can overwhelm the result.
Make Δx small, but not too small. 😟
3/
While people tend to think of this as a floating point arithmetic problem, it arises anytime data has a range of uncertainty.
Example with integer arithmetic:
(50 ± 2) - (45 ± 2) ⟶ (5 ± 4)
4% relative errors turn into 80%.
4/
Part of the art of numeric computing is algebraically rearranging calculations to avoid subtracting nearly equal values.
A famous example is rewriting the quadratic equation as:
𝑥 = 2𝑐 ÷ (𝑏∓ √(𝑏²−4𝑎𝑐))
This is used when |4ac| is small relative to |b²|.
5/
The “loss of significance” problem also arises with computing time differences:
With dataclasses, you get nice attribute access, error checking, a name for the aggregate data, and a more restrictive equality test. All good things.
Dicts are at the core of the language and are interoperable with many other tools: json, **kw, …
@yera_ee Dicts have a rich assortment of methods and operators.
People learn to use dicts on their first day.
Many existing tools accept or return dicts.
pprint() knows how to handle dicts.
Dicts are super fast.
JSON.
Dicts underlie many other tools.
@yera_ee Embrace dataclasses but don't develop an aversion to dicts.
Python is a very dict centric language.
Mentally rejecting dicts would be like developing an allergy to the language itself. It leads to fighting the language rather than working in harmony with it.
1/ #Python data science tip: To obtain a better estimate (on average) for a vector of multiple parameters, it is better to analyze sample vectors in aggregate than to use the mean of each component.
Surprisingly, this works even if the components are unrelated to one another.
2/ One example comes from baseball.
Individual batting averages near the beginning of the season aren't as good of a performance predictor as individual batting averages that have been “shrunk” toward the collective mean.
Shockingly, this also works for unrelated variables.
3/ If this seems unintuitive, then you're not alone. That is why it is called Stein's paradox 😉
The reason it works is that errors in one estimate tend to cancel the errors in the other estimates.