Taco Cohen Profile picture
May 14 21 tweets 3 min read
Agree with Nando that scaling will get us very far and will produce lots of useful tech, but there is at least one thing that may require some new ideas and insights... 1/n
That thing is getting to a high level of competence in a new or modified domain *quickly*, using relatively little (or sometimes no) labelled data/human-produced text/real-world interactive experience 2/n
People will point to transfer capabilities as evidence that this ability can come from scaling, and indeed it might just work. So far at least it has worked better than most people believed just a few years ago. Let's try and see. 3/n
But the argument that human evolution is similar to pre-training (so your life is just a matter of fine-tuning) and thus provides an existence proof that large-scale DL alone can bring human-level generalisation capabilities doesn't fly, for a very simple reason... 4/n
The information that evolution has equipped us with can be encoded in ~350mb of DNA, whereas pre-trained nets are far far bigger these days. Weights can be compressed, but not by many orders of magnitude. 5/n
So it seems that what evolution has equipped us with is some very general prior knowledge and learning algorithms, that work robustly in a huge range of environments 6/n
These algorithms are data efficient, at least when it comes to "expensive data", like being taught by another person, trying dangerous things (climbing trees, driving cars), etc.. They still end up training a huge model, the brain, but mostly from "cheap" data. 7\n
One could argue that we can find that 350mb of prior+algo by meta learning or something, and again it might just work. But so far I have not seen any overwhelming successes from e.g. neural architecture search or learned optimizers. 8/n
Everyone still uses handcrafted transformer/CNN architectures and Adam, so these learned architectures and optimizers can't be that much better, if they are indeed better at all. 9/n
More generally, it seems like the current DL approach is good at absorbing tons of information from data, but not that great at extracting "the essence". 10/n
By essence I mean something that takes few bits to describe, but explains a lot, and generalizes very far even if it cannot always be used to predict every detail. Grammars, laws of physics, causal mechanisms, symmetries, etc. 11/n
I'm currently exploring the hypothesis that something that could be called causality might help us address some of these limitations 12/n
So I am not a religious believer in scaling (I think it will be necessary but probably not sufficient for AGI / HLI), but I also disagree with the DL haters. 13/n
It is pretty clear that models will have to store tons of details/facts/regularities about the world, and there is no way we're going to program that by hand. DL does the job, and I don't see why we'd need something else for this. So new ideas better be DL-compatible 14/n
Current causality frameworks & algos generally aren't. They typically require expert input about which variables to use, which causal relations might be present, etc. That's fine for applications in science, but not for autonomously learning AI agents. 15/n
Furthermore, many algorithms for e.g. causal discovery / graph learning rely on conditional independence testing and have terrible scaling behaviour (computationally and statistically) 16/n
However, I think that it should be possible to make DL causality-aware using many of the primitives that have proven successful in modern DL/AI, i.e. no need to throw everything out as some causalists claim. 17/n
But we probably do need some shift in perspective and a couple new ideas. In particular, we need to do away with the idea that the ultimate goal of learning is just to fit a particular conditional probability distribution. 18/n
The distribution of internet text/images and even an agent's "life experience" is in some sense arbitrary (unless the experience is very deliberately sought out, as in the scientific method), and more importantly distributions will always continue to shift... 19/n
... but some underlying regularities that might be called laws or mechanisms appear to be more stable. Finding scalable algorithms that can autonomously discover and harness them seems like a wonderful challenge to me.
Curious to hear perspectives from anyone, if there's any obvious flaws in this reasoning or if you estimate chances differently. But let's be dispassionate scientists and not ideologues :)

• • •

Missing some Tweet in this thread? You can try to force a refresh

Keep Current with Taco Cohen

Taco Cohen Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!


Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!


0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy


3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!