Shawn Presser Profile picture
Looking for AI work. DMs open. ML discord: https://t.co/2J63isabrY projects: https://t.co/6XsuoK4lu0
Learning in Public - Coding - DataSci Profile picture 1 subscribed
Mar 31, 2023 10 tweets 3 min read
An anonymous donor named L has pledged $200k for llama-dl’s legal defense against Meta. I intend to initiate a DMCA counterclaim on the basis that neural network weights are not copyrightable.

It may seem obvious that NN weights should be subject to copyright. It’s anything but: The US copyright office recently upheld a decision that ML outputs cannot be copyrighted. There are several reasons for this, and as far as I can tell, all of them apply to NN weights as well: reuters.com/world/us/us-co…
Mar 6, 2023 9 tweets 3 min read
Fixed the llama sampler. After turning off top_p, adding top_k 40, setting temp to 0.7, and adding a repetition penalty of 1/0.85, llama 7B is looking nice.

I'll post 65B next, along with (hopefully) some big text files with lots of outputs. ImageImage That's more like it. Hello, 65B.

It's always remarkable to see just how important the settings are.

Taking Pip for a walk, then I'll be back to post more. Image
Feb 13, 2023 22 tweets 5 min read
~@AlexSheng13 proposed an interesting idea in DMs which I think is worth studying. Suppose you could train every layer of your network simultaneously, without waiting for gradients. No backprop. In fact, no forward prop.

This sounds crazy, but there’s a clever way it could work. First, let’s ignore that this sounds impossible, and look at the benefits. What does this get us?

Our scale becomes infinite, because we can place every layer on a different device. In fact, they can be on different continents, and it wouldn’t harm training time.
Feb 6, 2022 12 tweets 3 min read
Being forced to learn Haskell had an upside: I’m able to reason about the type signatures of the functions I use, even in Python. I didn’t think that way before.

I’m less enthusiastic about Haskell than other languages, but I was surprised there was any benefit at all. Why is this useful? And how is it different from my prior mental model?

Before, I thought of functions as little machines. So if I pass a function into map, it was similar to telling a Roomba to clean your house. Whether you ask a Roomba or a maid or clean it yourself, there’s \
Jun 5, 2021 9 tweets 5 min read
So, I'm a huge fan of FF7 speedrunning. There's a certain boss that has an 8% chance of killing you at the start of the fight. But speedrunner Caleb seems to die much more than 8% of the time.

To my delight, @AceZephyr1 made a *fully automated testing harness*. Incredible! The goal is to statistically verify whether Caleb's luck is worse than 8%. There might be something else going on. For example, FF7 uses a separate RNG for enemy encounter rate, and you can manipulate it by walking a certain number of steps in certain rooms.
Jun 4, 2021 6 tweets 4 min read
Wow. I'm SSH'd into a TPU v3-8. It has 96 CPUs and 335GB of RAM. Incredible. I installed npm:

snap install npm
npm i -g http-server
sudo http-server -p 80

Then I added Cloudflare DNS.

Presto: a 96-core NodeJS website (for the next 3h): tpu-121.gpt4.org/itworks.txt

It was so easy! If you haven't heard about SSH'ing into TPU VMs, it's a new feature! @jekbradbury's team recently released it:



They've been working on this for quite some time. And holy moly, it was worth the wait.
Jun 3, 2021 7 tweets 4 min read
Discovery for my notes: I came up with a variant of FFT I call "FST" (for Fast Shawn Transform, ha)

- FST is its own inverse: fst(fst(x)) = x
- FST of an NxM signal returns NxM real numbers. No phase!
- FST is frequency space, just like FFT. Multiplication is convolution.

Code: import numpy as np; from numpy.fft import fft, fft2

def fst(x): return fft((1 + 1j)*x).real / (area(x) ** 0.5)

def fst2(x): return fft2((1 + 1j)*x).real / (area(x) ** 0.5)

def area(x): return np.prod(x.shape[-2:])

>>> fst(fst(np.arange(5)))
[0, 1, 2, 3, 4]
Jun 2, 2021 5 tweets 1 min read
So this is incredibly strange and cool. For my notes:

It's well-known that if you take the FFT of an NxN image, you only need NxN floats to recover the original image. But usually those are (NxN)/2 complex numbers, e.g. rfft2 is complex.

I've discovered a real-only alternative: Here's how it works. Suppose you have a picture of a cat. First, you multiply the cat by (1 + 1j), so that you end up with a complex number where both the .real and the .imag parts are the cat image. Then you take the FFT of that.
Oct 25, 2020 12 tweets 5 min read
Suppose you wanted to train a world-class GPT model, just like OpenAI. How? You have no data.

Now you do. Now everyone does.

Presenting "books3", aka "all of bibliotik"

- 196,640 books
- in plain .txt
- reliable, direct download, for years: the-eye.eu/public/AI/pile…

thread 👇 I wrote up some details here: github.com/soskek/bookcor…

In OpenAI's papers on GPT-2 and 3, you'll notice references to datasets named "books1" and "books2".

books1 appears to be bookcorpus, or similar.

But OpenAI will not release information about books2; a crucial mystery.
May 28, 2020 14 tweets 4 min read
lol. So, we're doing some image processing with TPUs. We want to save the results directly to our cloud bucket, rather than having the results be transmitted to our VM, saved locally, then uploaded to our cloud bucket. Got a funny idea...

I guess this will be a ramble: TPUs support a limited number of operations. But what you get in exchange is a blazingly-fast TPU.

A TPU consists of 8 cores, plus a CPU. (Yes, the TPU has a CPU -- weird concept, but think of it like a big computer with 8 GPUs. Obviously, a computer with GPUs has a CPU.)
Jan 31, 2020 10 tweets 3 min read
Success: I trained ResNet-50 on imagenet to 75.9% top-1 accuracy in 3.51 minutes using a 512-core TPUv3.

(480,000 images per second. 224x224 res JPG.)

Before you think highly of me, all I did was run Google’s code. It was hard though.

Logs: tensorboard.dev/experiment/jsD… It uses the code from their official MLPerf imagenet benchmark. mlperf.org/training-resul…

(3.51 minutes for v3-512 is slightly faster than their posted results of 3.85min, too!)