One takeaway for me from (#dalle2, #imagen, #flamingo) is there's no one "golden algorithm" to unlock these new transfer learning capabilities. Contrastive, AR, Freezing, Priors, they all can work. You almost can't stop these models from exhibiting these new types of behavior...
...It reminds me a lot of early DL days, when people used to think you needed sparsity regularization to learn nice gabor filters in NNs, but then it turned out than almost any model with convolution and enough natural data would learn them on their own...
...We shifted our attention to different parts of the problem, as the features of visual transfer learning with pretrained convnets were just "a given" to be nice representations regardless of the architecture and dataset (kind of crazy when you think about it)...
...The past month has felt a lot like those 2012-2016 days of just seeing the tip of the iceberg of a new transfer learning paradigm and a new set of things that we start to take for granted ("of course" a LLM works fine for multimodal transfer to a wildly different domain...)
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Rather than training domain-specific models for each dataset, we show that a seq2seq approach can jointly train on many different datasets with arbitrary combinations of instruments. This is an important step towards general purpose music transcription.
Why are we doing this if we're supposed to be working on machine learning for creativity? Transcription extracts notes from audio, which are useful both for human control and for training powerful language models on symbolic music from real audio (e.g. Music Transformer)
1/4 Sorry for another AI rant, I'm just reminded on a daily basis of how harmful the term really is. Almost all technologies could be much better described by saying what they actually do, where the "A" is "automation" and/or "augmentation", and hardly artificial.
It gives a much clearer picture of what a technology does, how it changes power dynamics of society, and who's responsible for its creation and use.
The distinction between augmentation and automation is really a subjective one, depending on whether the process being automated is something that people feel still has value being done manually by a person. There's nothing new about that, machine learning just accelerates it.
A lot of folks have been asking me my thoughts about the recent Jukebox work by @OpenAI, so I thought a thread might help. I feel like I have separate reactions from three different parts of my identity:
1) ML researcher 2) ML researcher of music 3) Musician
Long thread :)
1/17
1) As an ML researcher, I think the results are really impressive! The model builds directly off of the VQ-VAE2 work of @avdnoord, hierarchically modeling discrete codes with transformer priors, and autoregressive audio approaches of @sedielem.
2/17
This work shows that with meticulous engineering and TONS of data (more on that later) these models can really scale! Sander and I have had a friendly back and forth about this approach for years, and I was truly amazed the output quality. It’s really impressive research!
3/17
2/ tl; dr: We've made a library of differentiable DSP components (oscillators, filters, etc.) and show that it enables combining strong inductive priors with expressive neural networks, resulting in high-quality audio synthesis with less data, less compute, and fewer parameters.
3/ An example DDSP module is an Additive Synthesizer (sum of time-varying sinusoids). A network provides controls (frequencies, amplitudes), the synthesizer renders audio, and the whole op is differentiable . Here's a simple example with harmonic (integer multiple) frequencies.
2/ tl; dr: We show that for musical instruments, we can generate audio ~50,000x faster than a standard WaveNet, with higher quality (both quantitative and listener tests), and have independent control of pitch and timbre, enabling smooth interpolation between instruments.
3/ We explore a range of architectures and audio representations and find that the best results come from generating in the spectral domain, with large FFT sizes to allow for better frequency resolution (H) and generating the instantaneous frequency (IF) instead of phase directly