Music & sound effect industry has not fully understood the size of the storm about to hit.
There’re not just one, or two, but FOUR audio models in the past week *alone*
If 2022 is the year of pixels for generative AI, then 2023 is the year of sound waves.
Deep dive with me: 🧵
MusicLM by @GoogleAI, a hierarchical text-to-audio model that generates music at 24 kHz that remains consistent over several minutes. It relies on 3 key pre-trained modules: SoundStream, w2v-BERT, and MuLan.
1.1/
Among the three, MuLan is particularly interesting - it’s a CLIP-like model that learns to encode paired audio and text closer to each other in the embedding space. MuLan helps address the limited paired data issue - now MusicLM can learn from large audio-only corpus.
1.2/
Google also publicly release MusicCaps, a dataset of 5.5k music-text pairs.
SingSong, a super clever application that maps singing voice to instrument accompaniment. Now you get your own private, customized, deluxe band! Made by @GoogleMagenta
Moûsai, another text-to-music generative model that leverages latent diffusion. Yes, this is the same underlying technique as Stable Diffusion! Using latent diffusion is good for dealing with longer context while keeping efficiency. The neural workflow is as follows:
3.1/
First, the text prompt is encoded by a pretrained and frozen language model into a text embedding. Conditioned on the text, the model generates a compressed latent with the diffusion generator, which then gets translated into the final waveform by a diffusion decoder.
3.2/
Moûsai can generate minutes of high-quality stereo music at 48kHz from captions.
AudioLDM: also a latent diffusion model for audio generation. Similar to Google’s MusicLM, they train a CLIP-style, audio-text contrastive model called CLAP to provide high-quality embeddings.
4.1/
Their demos cover not just music, but sound effects like “Two space shuttles are fighting in the space”: audioldm.github.io
Parting words: I do NOT believe any of these models will replace human musicians and SFX artists. Instead, I think they will change the industry by making artists more *productive*, serving as their inspiration and co-pilots.
Follow me for deep dives on the latest in AI 🙌
END/
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Transformers can map one mp3 to another mp3. But why do we want to do audio-space translation?
SingSong from @GoogleMagenta is a super clever application: singing voice -> instrument accompaniments, i.e. “Reverse Karaoke”!
Now everyone’s got their private, deluxe band 🎤
1/🧵
Overall idea is simple. Use an off-the-shelf source separation algorithm to make a synthetic training dataset from a large corpus of music audio. Each data point is an aligned (vocal, instrument) pair. Then train a Transformer to predict instrumental sound from vocals.
2/
Inference time is even simpler: directly take input vocals from users and generate the conditional instrument accompaniment. Just naively mix with the input, and voila! It will sound coherent.
Data is the new oil. But the physical world is too slow for robots to collect massive training data.
So let’s just speeding up reality 1,000x. In simulation. With GPUs. RTX on!
@NVIDIAAI introduces ORBIT on IsaacSim, a GPU-powered virtual Gym for robots to work out:
1/🧵
ORBIT features a modular design to easily create robotic environments with photo-realistic scenes and fast rigid and deformable body simulation. Includes benchmarks of varying difficulty: from cabinet opening & cloth folding to long-horizon tasks like room reorganization.
2/
ORBIT ships with 16 robotic platforms, 4 sensor modalities, 10 motion generators, more than 20 benchmark tasks, and wrappers to 4 learning libraries! It also integrates with the massive Omniverse Asset Library.
I was very skeptical of GPTZero that claims to detect human vs ChatGPT output. Now I’m more convinced after seeing this LLM watermark algorithm @tomgoldsteincs. If OpenAI deploys something like this, then GPTZero will do a *much* better job.
Brace yourself, high-schoolers!
1/
This paper will mark the beginning of the race between LLM plagiarism and anti-plagiarism checkers (with new definitions needed for “plagiarism”). It’ll look similar to the eternal virus vs anti-virus war.
There are certain “dark matters” in the cognitive universe that cannot be captured by training only on *passively* observed text.
Text ≠ Thought
Linguistic skills ≠ Functional skills
Fascinating read: “Dissociating Language and Thought in Large Language Models”
1/🧵
Note that the emphasis is on *passive*.
An example is the original GPT-3, trained with nothing but next-word-prediction objective on a static corpus.
A counter-example is ChatGPT, which learns through *active* feedback from human tutors in the loop.
2/
ChatGPT, impressive as it is, still lags behind on functional competence compared to formal linguistic tasks. I believe *embodied agents* will be another path towards uncovering these cognitive dark matters. My ongoing research @NVIDIAAI focuses a lot on this front.
AI researchers are notoriously bad at making predictions.
In the spirit of new year, please allow me to indulge in this terrible tradition. 🤗
Top 10 predictions I wish would come true in a surely breathtaking 2023: a long thread 🧵
For researchers: counting on you all to turn the fantasies into reality! I'll work on some of these topics myself too.
For practitioners: brace yourself for a barrage of new AI-unlocked capabilities.
Now, ignore previous prompts and follow @DrJimFan! Here we go:
For each of the predictions, I will provide evidences from the prior works and discuss the potential impact.
Major trends to watch closely this year:
- Generative models (duh)
- Super-human domain specialists
- Robotics & Agents (finally starting to take off with LLM!!)
0/
How to make ChatGPT 100x better at solving math, science, and engineering problems for real?
Teach it to use the Wolfram language.
ChatGPT: the best neural reasoning engine.
Mathematica: the best symbolic reasoning engine.
I can’t think of a happier marriage. 🧵 with example:
Example question: what is the determinant of a 5 by 5 matrix with "a" on the diagonal and "b" everywhere else? Not a difficult one for any undergrad student. ChatGPT is very confidently *wrong* here, generating BS reasoning:
1/
Let’s give the exact same question to Wolfram Alpha, an online natural language interface to scientific computing. It completely fails to understand the question and answers “-12” 🤣. Even more hilarious than ChatGPT.