Python is the language that I have used for nearly all my work over the last few years. It is a beautiful language. It has an elegant core on which everything else is built.
But it comes with a downside: performance. It's thousands of times slower than C.
Python programmers learn to avoid using Python for the implementation of performance-critical sections, instead using Python wrappers over C, FORTRAN, Rust, etc.
But this “two-language” approach has serious downsides. It's complex, hard to debug or profile, & hard to deploy.
Also, it leaves a lot of performance on the table. That's why @PyTorch, @TensorFlow, and #jax don't use Python for anything fast - they use separate compilers for Python DSL's or subsets. pytorch.org/tutorials/inte…
Mojo is "syntax sugar for MLIR". It has a small foundation which basically provides a simple way to access MLIR from a Python-like language, and then everything else is written on top of that. mlir.llvm.org
A Mojo trick is to opt in to a faster “mode” by using “fn” instead of “def” - as a result Mojo can create optimised machine code to implement your function. & use “struct” instead of “class” to tightly pack your attrs in memory, to avoid pointer chasing
Mojo isn't finished - but what's there is already mind-blowing, and it has been created by a very small team in a very short time. This shows the benefits of using carefully architected foundations, based on @clattner_llvm's years of experience with Clang, LLVM, and Swift.
There are lots of other great alternatives to getting high performance and the benefits of elegant programming language, including @JuliaLanguage, #cython, and @numba_jit.
These all have their place, but they're not perfect. Eg here's my thoughts on Julia
In particular, Mojo is the first to solve deployment.
A Mojo app can be compiled into a small, standalone, fast-starting binary. This is a game changer! Think about the things you could do if you could create small fast tools quickly & easily, & distribute them in a single file.
Mojo is *far more* than a language for AI/ML applications. It’s actually a version of Python that allows us to write fast, small, easily-deployed applications that take advantage of all available cores and accelerators! modular.com/mojo
If you want to know more or have any questions, check out the full blog post, which has much more detail than this thread: fast.ai/posts/2023-05-…
• • •
Missing some Tweet in this thread? You can try to
force a refresh
First, I've seen that one of the most common responses is that anyone criticising the original post clearly doesn't understand it and is ignorant of how language models work.
Aidan Gomez is an author of the Transformers paper, and is CEO of Cohere. I think he understands fine.
So why haven't we seen clear explanations of why "checking for sudden drops in the loss function and suspending training" comment is so ludicrous?
Well, the problem is that it's such a bizarre idea that it's not even wrong. It's nonsensical. Which makes it hard to refute.
Sometimes it feels like NLP papers prior to 2020 don't exist...
(Bidirectional autoregressive models have been common for many years, and were for instance used in ULMFiT.)
AFAIK the first bidirectional RNN was from 1997. (Although it was popularised in Alex Grave's classic 2013 paper "Generating Sequences With Recurrent Neural Networks" I think.) ieeexplore.ieee.org/document/650093
@NguynTu24128917 might be worth updating your paper with some extra citations and background around this?
Our new course, "From Deep Learning Foundations to Stable Diffusion", is finally done after 8 months of work!!!
With >30 hours of video content (all free, no ads!), you'll learn how to create and train a Stable Diffusion model starting from pure Python 🧵 fast.ai/posts/part2-20…
This field was developing rapidly as we were developing and teaching the course, so many lessons include a walk-through of a paper that had just been released.
We also implement key papers that aren't in Stable Diffusion, such as Karras et al (2022) arxiv.org/abs/2206.00364
I wouldn't have been able to keep up with all this research without the fantastic help of folks from @StabilityAI, @huggingface, and the generative AI community. @iScienceLuvr and @johnowhitaker even joined me to teach some lessons together, which was a blast!
There's a lot of folks under the misunderstanding that it's now possible to run a 30B param LLM in <6GB, based on this GitHub discussion.
This is not the case. Understanding why gives us a chance to learn a lot of interesting stuff! 🧵 github.com/ggerganov/llam…
The background is that the amazing @JustineTunney wrote this really cool commit for @ggerganov's llama.cpp, which modifies how llama models are loaded into memory to use mmap github.com/ggerganov/llam…
Prior to this, llama.cpp (and indeed most deep learning frameworks) load the weights of a neural network by reading the file containing the weights and copying the contents into RAM. This is wasteful since a lot of bytes are moving around before you can even use the model
Intriguing new study from the amazing Adriaan Bax and team suggests that most covid deaths resulted from (preventable) snoring droplets rather than (unpreventable) microaspiration. This could be a game changer.
Infection of the lung with SARS-CoV-2 is a two-step process: first the nose / throat, then the lungs. Postulated, but physically implausible, mechanism for step 2 involves “microaspiration”
Microaspiration during sleep is the accepted “hand-waving” mechanism for transfer of microbes from the oral cavity into the lung
After just 2 weeks of the new @fastdotai course, our students are already making research advances in Stable Diffusion.
@sebderhy developed a novel yet simple modification to classifier-free guidance that gives better results (previous approach on left, new approach on right)
@fastdotai@sebderhy I think in this case there's room to improve the results even further. The basic idea being tackled is that the "old way" of doing guidance actually increased the scale of the update (especially if the difference between conditional and unconditional embeddings is large)
So the trick is to add the guidance without changing the scale.