First, I've seen that one of the most common responses is that anyone criticising the original post clearly doesn't understand it and is ignorant of how language models work.
Aidan Gomez is an author of the Transformers paper, and is CEO of Cohere. I think he understands fine.
So why haven't we seen clear explanations of why "checking for sudden drops in the loss function and suspending training" comment is so ludicrous?
Well, the problem is that it's such a bizarre idea that it's not even wrong. It's nonsensical. Which makes it hard to refute.
To understand why, we need to understand how these models work. There are two key steps: training, and inference.
Training a model involves calculating the derivatives of the loss function with respect to the weights, and using those to update the weights to decrease loss.
Inference involves taking a model that has gone through the above process (called "back propagation") many times and then calculating activations from that trained model using new data.
Neither training nor inference can have any immediate impact on the world. They are simply calculating the parameters of a mathematical function, or using those parameters to calculate the result of a function.
Therefore, we don't need to check for sudden drops in the loss function and suspend training, because the training process has no immediate impact on the outside world.
The only time that a model can impact anything is when it's *deployed* - that is, it's made available to people or directly to external systems, being provided data, making calculations, and then those results being used in some way.
So in practice, the way that models are *always* deployed is that after training, they are tested, to see how they operate on new data, and how their outputs work when used in some process.
Now of course, if we'd seen during training that our new model has much lower loss than we've seen before, whilst we wouldn't "suspend training", we would of course check the model's practical performance extra carefully. After all, maybe it was a bug? Or maybe it's more capable?
But saying "we should test our trained models before deploying them" is telling no-one anything new whatsoever. We all know that, and we all do that.
Figuring out better ways to test models before deployment is an active and rich research area.
OTOH, "check for sudden drops in the loss function and suspend training" sounds much more exciting.
Problem is, it's not connected with the real world at all.
Some folks have pointed out that "drops in the loss function" is a pretty odd way to phrase things. It's actually just "drops in the loss".
An AI researcher saying "drops in the loss function" is a bit like a banker saying "ATM machine" - maybe a slip, maybe incompetence.
PS: please don't respond to this thread with "OK the exact words don't make sense, but if we wave our hands we can imagine he really meant some different set of words that if we squint kinda do make sense".
I don't know why some folks respond like this *every* *single* *time*.
PPS: None of this is to make any claim as to the urgency or importance of working on AI alignment. However, if you believe AI alignment is important work, I hope you'll agree that it's worth discussing with intellectual rigor and with a firm grounding of basic principals.
*principles
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Sometimes it feels like NLP papers prior to 2020 don't exist...
(Bidirectional autoregressive models have been common for many years, and were for instance used in ULMFiT.)
AFAIK the first bidirectional RNN was from 1997. (Although it was popularised in Alex Grave's classic 2013 paper "Generating Sequences With Recurrent Neural Networks" I think.) ieeexplore.ieee.org/document/650093
@NguynTu24128917 might be worth updating your paper with some extra citations and background around this?
Our new course, "From Deep Learning Foundations to Stable Diffusion", is finally done after 8 months of work!!!
With >30 hours of video content (all free, no ads!), you'll learn how to create and train a Stable Diffusion model starting from pure Python 🧵 fast.ai/posts/part2-20…
This field was developing rapidly as we were developing and teaching the course, so many lessons include a walk-through of a paper that had just been released.
We also implement key papers that aren't in Stable Diffusion, such as Karras et al (2022) arxiv.org/abs/2206.00364
I wouldn't have been able to keep up with all this research without the fantastic help of folks from @StabilityAI, @huggingface, and the generative AI community. @iScienceLuvr and @johnowhitaker even joined me to teach some lessons together, which was a blast!
There's a lot of folks under the misunderstanding that it's now possible to run a 30B param LLM in <6GB, based on this GitHub discussion.
This is not the case. Understanding why gives us a chance to learn a lot of interesting stuff! 🧵 github.com/ggerganov/llam…
The background is that the amazing @JustineTunney wrote this really cool commit for @ggerganov's llama.cpp, which modifies how llama models are loaded into memory to use mmap github.com/ggerganov/llam…
Prior to this, llama.cpp (and indeed most deep learning frameworks) load the weights of a neural network by reading the file containing the weights and copying the contents into RAM. This is wasteful since a lot of bytes are moving around before you can even use the model
Intriguing new study from the amazing Adriaan Bax and team suggests that most covid deaths resulted from (preventable) snoring droplets rather than (unpreventable) microaspiration. This could be a game changer.
Infection of the lung with SARS-CoV-2 is a two-step process: first the nose / throat, then the lungs. Postulated, but physically implausible, mechanism for step 2 involves “microaspiration”
Microaspiration during sleep is the accepted “hand-waving” mechanism for transfer of microbes from the oral cavity into the lung
After just 2 weeks of the new @fastdotai course, our students are already making research advances in Stable Diffusion.
@sebderhy developed a novel yet simple modification to classifier-free guidance that gives better results (previous approach on left, new approach on right)
@fastdotai@sebderhy I think in this case there's room to improve the results even further. The basic idea being tackled is that the "old way" of doing guidance actually increased the scale of the update (especially if the difference between conditional and unconditional embeddings is large)
So the trick is to add the guidance without changing the scale.
We just released the first 5.5 hours of our new course "From Deep Learning Foundations to Stable Diffusion", for free! fast.ai/posts/part2-20…
Lesson 9 starts with a tutorial on how to use pipelines in the Diffusers library to generate images. We show some nifty tweaks like guidance scale and textual inversion.
The second half of the lesson shows the key concepts involved in Stable Diffusion.
Lesson 9A (by @johnowhitaker) shows what is happening behind the scenes, looking at the components and processes and how each can be modified for control over generation.