There seems to be a high correlation between folks who think "open source LLMs will win" & folks who
1) haven't developed any large-scale infra "in the cloud" themselves
or
2) have never priced out the cost of scaling-out "in the cloud" (for the most powerful models)
1/6
or
3) underestimate the cost of investing in data infrastructure and tooling
(have you seen why Databricks exists?)
2/6
OSS is needed to justify the value of continuing to push the limits of scale (Sutton's bitter lesson) in enabling quick prototypes and demos of possible applications.
But no single software library will solve manual integration glue with existing systems, and...
3/6
Piling on to the pile-on (sorry - it's always easy to criticize 😛), here's a rant about benchmarks for LLMs that are used to back claims of "stronger" or "better" models.
Let's start with a tour through GPT-3's Appendix G... 1/8
First up: BoolQ. If you download the actual benchmark, it's true/false completions. GPT-3 swaps in yes/no instead. Why? Well when we did the same swap to yes/no, we saw a +10% accuracy jump on this benchmark.
Wonderful. Clearly on track for a better model already. 2/8
Next up: formatting. Why does CB get prompted for true/false and RTE with True/False?
Why does WebQA use "Q/A", WiC use "question/answer", and ARC use "Question/Answer"?
Could it be... that you simply get better results switching it up? 🤔
After ignoring the details in all these "lets-fit-a-cloud-of-points-to-a-single-line" papers (all likely wrong when you really extrapolate), @stephenroller finally convinced me to work through the math in the Chinchilla paper and as expected, this was a doozy. [1/7]
First thing to make me eye-roll a bit was this fancy equation (4) that seems to re-parameterize the key exponent terms (a,b) into (alpha,beta) to define a coefficient term G. Why this level of indirection just to define a scalar-coefficient? No idea. [2/7]
So then you naturally start wondering what A/B/a/b could be. First stop: (a,b) is set to different values for 3 different "Approaches" in Table 2, each seeming to differ by just a hair: (0.5,0.5) vs (0.49,0.51) vs (0.46,0.54). Ok, sure, why not.