Horace He Profile picture
Working at the intersection of ML and Systems @ PyTorch "My learning style is Horace twitter threads" - @typedfemale
Jerome Ku Profile picture 1 subscribed
Dec 7, 2023 21 tweets 9 min read
As mentioned previously, I found AlphaCode2 accounts, and through stalking their submission history, I manually performed the AlphaCode2 Codeforces evals.

Overall, very impressive! I arrive at a rating of ~1650, which is the 85-90th percentile of CF users.

(1/19)
Image
Image
I am somewhat concerned about data leakage, see . This is an AlphaCode2 contributor's response

For the purposes of this analysis I'll take the results at face value.

(2/19)
Dec 7, 2023 7 tweets 3 min read
I reverse-engineered AlphaCode2's submission history and manually performed the Codeforces evals.

I'm ... again concerned that data leakage is affecting the results.

For the DP problem highlighted in the AlphaCode2 release, look at AC2's solution vs. the tutorial.

(1/5) Image It's admittedly a bit difficult to know for sure since DP problems do sometimes tend to have more formulaic solutions.

But let's look at 3 user solutions from the leaderboard ... and now let's look at 3 solutions from AlphaCode2.

There's much more diversity from user code
(2/5)
Image
Image
Nov 30, 2023 17 tweets 7 min read
Happy to OSS gpt-fast, a fast and hackable implementation of transformer inference in <1000 lines of native PyTorch with support for quantization, speculative decoding, TP, Nvidia/AMD support, and more!

Code:
Blog:

(1/12) github.com/pytorch-labs/g…
pytorch.org/blog/accelerat…
If we just start out with a naive implementation of transformer inference in PyTorch, the performance is ... not great, at about 25 tok/s. Looking at a trace, the number one reason is that it's heavily *overhead bound*. To fix this, we can apply torch.compile!

(2/12) Image
Mar 14, 2023 9 tweets 3 min read
I suspect GPT-4's performance is influenced by data contamination, at least on Codeforces.

Of the easiest problems on Codeforces, it solved 10/10 pre-2021 problems and 0/10 recent problems.

This strongly points to contamination.

1/4 ImageImage 800-rated problems are the easiest problems on Codeforces, and are determined automatically based off of the ratings of the people solving them during the contest. Thus, I would expect that these problems are roughly of "equal" difficulty, and my spot check would agree.

2/4
Feb 27, 2023 19 tweets 7 min read
Recently, Karpathy tweeted that *increasing* the size of his matmul made it run faster.

But... why? Many people seem content to leave this as black magic. But luckily, this *can* be understood!

Here's a plot of FLOPs achieved for square matmuls. Let's explain each curve!

1/19 There are 3 concepts needed to explain the above graph - compute intensity, tiling, and wave quantization.

Compute intensity (along with increased parallelism) is the primary factor explaining why matmuls generally perform better as you increase the size.

2/19
Feb 1, 2023 13 tweets 6 min read
Let's talk about a detail that occurs during PyTorch 2.0's codegen - tiling.

In many cases, tiling is needed to generate efficient kernels. Even for something as basic as torch.add(A, B), you might need tiling to be efficient! But what is tiling? And when is it needed?

(1/13) Image To explain tiling, we first need to understand hardware memory accesses. Memory doesn't transfer elements one at a time - it transfers large "chunks". That is, even if you only only need one element, the GPU will load that element... and the 31 elements next to it.

(2/13) Image
Jan 21, 2023 9 tweets 4 min read
Another thing PyTorch 2.0 helps speed up - overhead.

Overhead is everything other than the GPU doing work. It can come from Python, the ML framework, CUDA kernel launches, etc. - regardless, it's why your nvidia-smi util is so low!

So... how do we diagnose and resolve it?
(1/8) Image Let's look at a simple example - resnet18 inference with a single image. We see that we achieve about 2ms latency - not great for this model.

But what's going on? Let's take a look at a profile trace to find out...

(2/8) Image
Dec 10, 2022 10 tweets 3 min read
Eager mode was what made PyTorch successful. So why did we feel the need to depart from eager mode in PyTorch 2.0?

Answer: it's the damn hardware!

Let's tell a story about how the assumptions PyTorch were based off of became untrue, and why PyTorch needed to evolve. (1/10) When PyTorch first came out, the prevailing wisdom was that eager-mode sacrificed performance for more flexibility.

But in practice, people didn't really notice performance gaps! In fact, in many cases, people found PyTorch to be *faster* than graph-mode frameworks.

Why? (2/10)
Oct 16, 2022 8 tweets 3 min read
I've found it unexpectedly useful to memorize facts about systems I work with.

Knowing these numbers allows one to 1. sanity check performance, 2. sketch out feasibility of technical solutions, and 3. reason about performance characteristics.

Some examples below: (1/7) GPT3 sequence length? 2048. GPT3 hidden dim? ~12000.

Attention FLOPs scale with seq^2 * d_model. Feedforward FLOPs scale with seq * d_model^2. Since the hidden dim is >> sequence length, of course attention is a negligible component of GPT3 FLOPs: (2/7)
Sep 7, 2022 10 tweets 4 min read
Ever since the V100, Nvidia has been cramming more and more "tensor cores" into each GPU generation.

But what *are* tensor cores? How can you use them to accelerate deep learning models by >10x?

And ... why does their existence make me somewhat sad :(

(1/9) Tensor cores, put simply, are "hardware hard-coded for matrix multiplication".

To be clear, that doesn't mean something abstract like "matrix multiplications are embarrassingly parallel" - there's literally a hardware instruction (mma.sync) that allows you to use them.

(2/9)
Aug 14, 2022 9 tweets 3 min read
Why is OpenAI's new compiler, Triton, so exciting? And what distinguishes it from other efforts to provide a Python DSL for programming Nvidia GPUs, like Numba?

To answer that, we need to look at the operation behind all of deep learning - matrix multiplication. (1/7) Image In neural networks, matmuls usually take up >99% of the computational cost. Everything else is a rounding error.

Correspondingly, modern accelerators (both GPUs and built* to perform matrix multiplications quickly. (2/7) Image
Jun 27, 2022 4 tweets 3 min read
Do you like einops? Do you like NamedTensors? Do you struggle with Numpy-style positional-based indexing?

Check out first class dimensions (from @Zachary_DeVito )! It unifies einops/named dims under one concept, and allows for even more!

Here are some neat examples (1/4) Here's a bunch of examples I found cool.

Last week, I saw @ch402 's request for an "einsum for gather", and I realized that first-class dims already handled it.

Moreover, it worked for the "2d version of gather" requested as well (2/4)

Oct 6, 2021 4 tweets 1 min read
openreview.net/forum?id=TVHS5…
A very ... interesting 4 page paper at ICLR. I'm curious to see the reviewers' reactions.
Nov 5, 2019 5 tweets 3 min read
Some metadata for those curious about their #ICLR2020 reviews.

1. Histogram of the average reviews.
2. Top x% deciles

Seems like reviews this year at @iclr_conf are substantially lower than previous years. Probably an artifact of the new [1,3,6,8] reviewing system. (1/n) @iclr_conf For experience:
Out of 7583 total #ICLR2020 reviews:
1078 "do not know much about this area"
2484 "have read many papers in this area"
2604 "have published 1 or 2 papers"
1417 "have published in this field for many years"

47% of reviews haven't published in this area!