Jeremy Howard Profile picture
Apr 28 16 tweets 3 min read Twitter logo Read on Twitter
I'm seeing a lot of people confused about this - asking: what exactly is the problem here? That's a great question!

Let's use this as a learning opportunity and dig in. 🧵
First, I've seen that one of the most common responses is that anyone criticising the original post clearly doesn't understand it and is ignorant of how language models work.

Aidan Gomez is an author of the Transformers paper, and is CEO of Cohere. I think he understands fine.
So why haven't we seen clear explanations of why "checking for sudden drops in the loss function and suspending training" comment is so ludicrous?

Well, the problem is that it's such a bizarre idea that it's not even wrong. It's nonsensical. Which makes it hard to refute.
To understand why, we need to understand how these models work. There are two key steps: training, and inference.

Training a model involves calculating the derivatives of the loss function with respect to the weights, and using those to update the weights to decrease loss.
Inference involves taking a model that has gone through the above process (called "back propagation") many times and then calculating activations from that trained model using new data.
Neither training nor inference can have any immediate impact on the world. They are simply calculating the parameters of a mathematical function, or using those parameters to calculate the result of a function.
Therefore, we don't need to check for sudden drops in the loss function and suspend training, because the training process has no immediate impact on the outside world.
The only time that a model can impact anything is when it's *deployed* - that is, it's made available to people or directly to external systems, being provided data, making calculations, and then those results being used in some way.
So in practice, the way that models are *always* deployed is that after training, they are tested, to see how they operate on new data, and how their outputs work when used in some process.
Now of course, if we'd seen during training that our new model has much lower loss than we've seen before, whilst we wouldn't "suspend training", we would of course check the model's practical performance extra carefully. After all, maybe it was a bug? Or maybe it's more capable?
But saying "we should test our trained models before deploying them" is telling no-one anything new whatsoever. We all know that, and we all do that.

Figuring out better ways to test models before deployment is an active and rich research area.
OTOH, "check for sudden drops in the loss function and suspend training" sounds much more exciting.

Problem is, it's not connected with the real world at all.
Some folks have pointed out that "drops in the loss function" is a pretty odd way to phrase things. It's actually just "drops in the loss".

An AI researcher saying "drops in the loss function" is a bit like a banker saying "ATM machine" - maybe a slip, maybe incompetence.
PS: please don't respond to this thread with "OK the exact words don't make sense, but if we wave our hands we can imagine he really meant some different set of words that if we squint kinda do make sense".

I don't know why some folks respond like this *every* *single* *time*.
PPS: None of this is to make any claim as to the urgency or importance of working on AI alignment. However, if you believe AI alignment is important work, I hope you'll agree that it's worth discussing with intellectual rigor and with a firm grounding of basic principals.
*principles

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Jeremy Howard

Jeremy Howard Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @jeremyphoward

Apr 28
Sometimes it feels like NLP papers prior to 2020 don't exist...

(Bidirectional autoregressive models have been common for many years, and were for instance used in ULMFiT.) Image
AFAIK the first bidirectional RNN was from 1997. (Although it was popularised in Alex Grave's classic 2013 paper "Generating Sequences With Recurrent Neural Networks" I think.)
ieeexplore.ieee.org/document/650093
@NguynTu24128917 might be worth updating your paper with some extra citations and background around this?
Read 4 tweets
Apr 5
Our new course, "From Deep Learning Foundations to Stable Diffusion", is finally done after 8 months of work!!!

With >30 hours of video content (all free, no ads!), you'll learn how to create and train a Stable Diffusion model starting from pure Python 🧵
fast.ai/posts/part2-20…
This field was developing rapidly as we were developing and teaching the course, so many lessons include a walk-through of a paper that had just been released.

We also implement key papers that aren't in Stable Diffusion, such as Karras et al (2022)
arxiv.org/abs/2206.00364
I wouldn't have been able to keep up with all this research without the fantastic help of folks from @StabilityAI, @huggingface, and the generative AI community. @iScienceLuvr and @johnowhitaker even joined me to teach some lessons together, which was a blast!
Read 11 tweets
Apr 3
There's a lot of folks under the misunderstanding that it's now possible to run a 30B param LLM in <6GB, based on this GitHub discussion.

This is not the case. Understanding why gives us a chance to learn a lot of interesting stuff! 🧵
github.com/ggerganov/llam…
The background is that the amazing @JustineTunney wrote this really cool commit for @ggerganov's llama.cpp, which modifies how llama models are loaded into memory to use mmap
github.com/ggerganov/llam…
Prior to this, llama.cpp (and indeed most deep learning frameworks) load the weights of a neural network by reading the file containing the weights and copying the contents into RAM. This is wasteful since a lot of bytes are moving around before you can even use the model
Read 25 tweets
Nov 21, 2022
Intriguing new study from the amazing Adriaan Bax and team suggests that most covid deaths resulted from (preventable) snoring droplets rather than (unpreventable) microaspiration. This could be a game changer.

No time for the paper? Then read this 🧵!
sciencedirect.com/science/articl…
Infection of the lung with SARS-CoV-2 is a two-step process: first the nose / throat, then the lungs. Postulated, but physically implausible, mechanism for step 2 involves “microaspiration” Image
Microaspiration during sleep is the accepted “hand-waving” mechanism for transfer of microbes from the oral cavity into the lung Image
Read 8 tweets
Oct 24, 2022
After just 2 weeks of the new @fastdotai course, our students are already making research advances in Stable Diffusion.

@sebderhy developed a novel yet simple modification to classifier-free guidance that gives better results (previous approach on left, new approach on right) Image
@fastdotai @sebderhy I think in this case there's room to improve the results even further. The basic idea being tackled is that the "old way" of doing guidance actually increased the scale of the update (especially if the difference between conditional and unconditional embeddings is large)
So the trick is to add the guidance without changing the scale.
Read 7 tweets
Oct 20, 2022
I got a special surprise for you all...

We just released the first 5.5 hours of our new course "From Deep Learning Foundations to Stable Diffusion", for free!
fast.ai/posts/part2-20…
Lesson 9 starts with a tutorial on how to use pipelines in the Diffusers library to generate images. We show some nifty tweaks like guidance scale and textual inversion.

The second half of the lesson shows the key concepts involved in Stable Diffusion.
Lesson 9A (by @johnowhitaker) shows what is happening behind the scenes, looking at the components and processes and how each can be modified for control over generation.
Read 8 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(