Latest Twitter Threads by @cHHillee on Thread Reader App

Dec 7, 2023 • 21 tweets • 9 min read

As mentioned previously, I found AlphaCode2 accounts, and through stalking their submission history, I manually performed the AlphaCode2 Codeforces evals.

Overall, very impressive! I arrive at a rating of ~1650, which is the 85-90th percentile of CF users.

(1/19)

I am somewhat concerned about data leakage, see . This is an AlphaCode2 contributor's response

For the purposes of this analysis I'll take the results at face value.

(2/19)

https://twitter.com/cHHillee/status/1732636161204760863

https://twitter.com/RemiLeblond/status/1732677521290789235

Dec 7, 2023 • 7 tweets • 3 min read

I reverse-engineered AlphaCode2's submission history and manually performed the Codeforces evals.

I'm ... again concerned that data leakage is affecting the results.

For the DP problem highlighted in the AlphaCode2 release, look at AC2's solution vs. the tutorial.

(1/5)

It's admittedly a bit difficult to know for sure since DP problems do sometimes tend to have more formulaic solutions.

But let's look at 3 user solutions from the leaderboard ... and now let's look at 3 solutions from AlphaCode2.

There's much more diversity from user code
(2/5)

Nov 30, 2023 • 17 tweets • 7 min read

Happy to OSS gpt-fast, a fast and hackable implementation of transformer inference in <1000 lines of native PyTorch with support for quantization, speculative decoding, TP, Nvidia/AMD support, and more!

Code:
Blog:

(1/12) github.com/pytorch-labs/g…
pytorch.org/blog/accelerat…

If we just start out with a naive implementation of transformer inference in PyTorch, the performance is ... not great, at about 25 tok/s. Looking at a trace, the number one reason is that it's heavily *overhead bound*. To fix this, we can apply torch.compile!

(2/12)

Mar 14, 2023 • 9 tweets • 3 min read

I suspect GPT-4's performance is influenced by data contamination, at least on Codeforces.

Of the easiest problems on Codeforces, it solved 10/10 pre-2021 problems and 0/10 recent problems.

This strongly points to contamination.

1/4

https://twitter.com/cHHillee/status/1635692008877727745

800-rated problems are the easiest problems on Codeforces, and are determined automatically based off of the ratings of the people solving them during the contest. Thus, I would expect that these problems are roughly of "equal" difficulty, and my spot check would agree.

2/4

Feb 27, 2023 • 19 tweets • 7 min read

Recently, Karpathy tweeted that *increasing* the size of his matmul made it run faster.

But... why? Many people seem content to leave this as black magic. But luckily, this *can* be understood!

Here's a plot of FLOPs achieved for square matmuls. Let's explain each curve!

1/19

https://twitter.com/karpathy/status/1621578354024677377

There are 3 concepts needed to explain the above graph - compute intensity, tiling, and wave quantization.

Compute intensity (along with increased parallelism) is the primary factor explaining why matmuls generally perform better as you increase the size.

2/19

Feb 1, 2023 • 13 tweets • 6 min read

Let's talk about a detail that occurs during PyTorch 2.0's codegen - tiling.

In many cases, tiling is needed to generate efficient kernels. Even for something as basic as torch.add(A, B), you might need tiling to be efficient! But what is tiling? And when is it needed?

(1/13)

To explain tiling, we first need to understand hardware memory accesses. Memory doesn't transfer elements one at a time - it transfers large "chunks". That is, even if you only only need one element, the GPU will load that element... and the 31 elements next to it.

(2/13)

Jan 21, 2023 • 9 tweets • 4 min read

Another thing PyTorch 2.0 helps speed up - overhead.

Overhead is everything other than the GPU doing work. It can come from Python, the ML framework, CUDA kernel launches, etc. - regardless, it's why your nvidia-smi util is so low!

So... how do we diagnose and resolve it?
(1/8)

Let's look at a simple example - resnet18 inference with a single image. We see that we achieve about 2ms latency - not great for this model.

But what's going on? Let's take a look at a profile trace to find out...

(2/8)

Dec 10, 2022 • 10 tweets • 3 min read

Eager mode was what made PyTorch successful. So why did we feel the need to depart from eager mode in PyTorch 2.0?

Answer: it's the damn hardware!

Let's tell a story about how the assumptions PyTorch were based off of became untrue, and why PyTorch needed to evolve. (1/10) When PyTorch first came out, the prevailing wisdom was that eager-mode sacrificed performance for more flexibility.

But in practice, people didn't really notice performance gaps! In fact, in many cases, people found PyTorch to be *faster* than graph-mode frameworks.

Why? (2/10)

Oct 16, 2022 • 8 tweets • 3 min read

I've found it unexpectedly useful to memorize facts about systems I work with.

Knowing these numbers allows one to 1. sanity check performance, 2. sketch out feasibility of technical solutions, and 3. reason about performance characteristics.

Some examples below: (1/7) GPT3 sequence length? 2048. GPT3 hidden dim? ~12000.

Attention FLOPs scale with seq^2 * d_model. Feedforward FLOPs scale with seq * d_model^2. Since the hidden dim is >> sequence length, of course attention is a negligible component of GPT3 FLOPs:

https://twitter.com/stephenroller/status/1579993017234382849

(2/7)

Sep 7, 2022 • 10 tweets • 4 min read

Ever since the V100, Nvidia has been cramming more and more "tensor cores" into each GPU generation.

But what *are* tensor cores? How can you use them to accelerate deep learning models by >10x?

And ... why does their existence make me somewhat sad :(

(1/9)

Tensor cores, put simply, are "hardware hard-coded for matrix multiplication".

To be clear, that doesn't mean something abstract like "matrix multiplications are embarrassingly parallel" - there's literally a hardware instruction (mma.sync) that allows you to use them.

(2/9)

Aug 14, 2022 • 9 tweets • 3 min read

Why is OpenAI's new compiler, Triton, so exciting? And what distinguishes it from other efforts to provide a Python DSL for programming Nvidia GPUs, like Numba?

To answer that, we need to look at the operation behind all of deep learning - matrix multiplication. (1/7)

In neural networks, matmuls usually take up >99% of the computational cost. Everything else is a rounding error.

Correspondingly, modern accelerators (both GPUs and built* to perform matrix multiplications quickly. (2/7)

Jun 27, 2022 • 4 tweets • 3 min read

Do you like einops? Do you like NamedTensors? Do you struggle with Numpy-style positional-based indexing?

Check out first class dimensions (from @Zachary_DeVito )! It unifies einops/named dims under one concept, and allows for even more!

Here are some neat examples (1/4)

https://twitter.com/Zachary_DeVito/status/1541477000015073280

Here's a bunch of examples I found cool.

Last week, I saw @ch402 's request for an "einsum for gather", and I realized that first-class dims already handled it.

Moreover, it worked for the "2d version of gather" requested as well (2/4)

https://twitter.com/ch402/status/1539774943214178304

Oct 6, 2021 • 4 tweets • 1 min read

openreview.net/forum?id=TVHS5…
A very ... interesting 4 page paper at ICLR. I'm curious to see the reviewers' reactions.

Nov 5, 2019 • 5 tweets • 3 min read

Some metadata for those curious about their #ICLR2020 reviews.

1. Histogram of the average reviews.
2. Top x% deciles

Seems like reviews this year at @iclr_conf are substantially lower than previous years. Probably an artifact of the new [1,3,6,8] reviewing system. (1/n)

@iclr_conf For experience:
Out of 7583 total #ICLR2020 reviews:
1078 "do not know much about this area"
2484 "have read many papers in this area"
2604 "have published 1 or 2 papers"
1417 "have published in this field for many years"

47% of reviews haven't published in this area!

Share this page!

Enter URL or ID to Unroll