Jim Fan Profile picture
Feb 2 11 tweets 5 min read
Music & sound effect industry has not fully understood the size of the storm about to hit.

There’re not just one, or two, but FOUR audio models in the past week *alone*

If 2022 is the year of pixels for generative AI, then 2023 is the year of sound waves.

Deep dive with me: 🧵 Image
MusicLM by @GoogleAI, a hierarchical text-to-audio model that generates music at 24 kHz that remains consistent over several minutes. It relies on 3 key pre-trained modules: SoundStream, w2v-BERT, and MuLan.

1.1/ Image
Among the three, MuLan is particularly interesting - it’s a CLIP-like model that learns to encode paired audio and text closer to each other in the embedding space. MuLan helps address the limited paired data issue - now MusicLM can learn from large audio-only corpus.

1.2/ Image
Google also publicly release MusicCaps, a dataset of 5.5k music-text pairs.

MusicLM demo: google-research.github.io/seanet/musiclm…
Paper: arxiv.org/abs/2301.11325

1.3/
SingSong, a super clever application that maps singing voice to instrument accompaniment. Now you get your own private, customized, deluxe band! Made by @GoogleMagenta

See my spotlight thread below:

2/
Moûsai, another text-to-music generative model that leverages latent diffusion. Yes, this is the same underlying technique as Stable Diffusion! Using latent diffusion is good for dealing with longer context while keeping efficiency. The neural workflow is as follows:

3.1/ Image
First, the text prompt is encoded by a pretrained and frozen language model into a text embedding. Conditioned on the text, the model generates a compressed latent with the diffusion generator, which then gets translated into the final waveform by a diffusion decoder.

3.2/ ImageImage
Moûsai can generate minutes of high-quality stereo music at 48kHz from captions.

Paper by Schneider et al: arxiv.org/abs/2301.11757
Demo: bit.ly/3YnQFUt
Code is open-source!! github.com/archinetai/aud…

3.3/
AudioLDM: also a latent diffusion model for audio generation. Similar to Google’s MusicLM, they train a CLIP-style, audio-text contrastive model called CLAP to provide high-quality embeddings.

4.1/ Image
Their demos cover not just music, but sound effects like “Two space shuttles are fighting in the space”: audioldm.github.io

Paper by Liu et al: arxiv.org/abs/2301.12503

4.2/
Parting words: I do NOT believe any of these models will replace human musicians and SFX artists. Instead, I think they will change the industry by making artists more *productive*, serving as their inspiration and co-pilots.

Follow me for deep dives on the latest in AI 🙌

END/

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Jim Fan

Jim Fan Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @DrJimFan

Feb 1
Transformers can map one mp3 to another mp3. But why do we want to do audio-space translation?

SingSong from @GoogleMagenta is a super clever application: singing voice -> instrument accompaniments, i.e. “Reverse Karaoke”!

Now everyone’s got their private, deluxe band 🎤

1/🧵
Overall idea is simple. Use an off-the-shelf source separation algorithm to make a synthetic training dataset from a large corpus of music audio. Each data point is an aligned (vocal, instrument) pair. Then train a Transformer to predict instrumental sound from vocals.

2/ Image
Inference time is even simpler: directly take input vocals from users and generate the conditional instrument accompaniment. Just naively mix with the input, and voila! It will sound coherent.

3/ Image
Read 5 tweets
Jan 27
Data is the new oil. But the physical world is too slow for robots to collect massive training data.

So let’s just speeding up reality 1,000x. In simulation. With GPUs. RTX on!

@NVIDIAAI introduces ORBIT on IsaacSim, a GPU-powered virtual Gym for robots to work out:

1/🧵
ORBIT features a modular design to easily create robotic environments with photo-realistic scenes and fast rigid and deformable body simulation. Includes benchmarks of varying difficulty: from cabinet opening & cloth folding to long-horizon tasks like room reorganization.

2/ Image
ORBIT ships with 16 robotic platforms, 4 sensor modalities, 10 motion generators, more than 20 benchmark tasks, and wrappers to 4 learning libraries! It also integrates with the massive Omniverse Asset Library.

3/ Image
Read 7 tweets
Jan 25
I was very skeptical of GPTZero that claims to detect human vs ChatGPT output. Now I’m more convinced after seeing this LLM watermark algorithm @tomgoldsteincs. If OpenAI deploys something like this, then GPTZero will do a *much* better job.

Brace yourself, high-schoolers!

1/
This paper will mark the beginning of the race between LLM plagiarism and anti-plagiarism checkers (with new definitions needed for “plagiarism”). It’ll look similar to the eternal virus vs anti-virus war.

arxiv.org/abs/2301.10226 by Kirchenbauer et al.

2/
Read 5 tweets
Jan 23
There are certain “dark matters” in the cognitive universe that cannot be captured by training only on *passively* observed text.

Text ≠ Thought
Linguistic skills ≠ Functional skills

Fascinating read: “Dissociating Language and Thought in Large Language Models”

1/🧵 https://www.sciencealert.com/dark-matter
Note that the emphasis is on *passive*.

An example is the original GPT-3, trained with nothing but next-word-prediction objective on a static corpus.

A counter-example is ChatGPT, which learns through *active* feedback from human tutors in the loop.

2/
ChatGPT, impressive as it is, still lags behind on functional competence compared to formal linguistic tasks. I believe *embodied agents* will be another path towards uncovering these cognitive dark matters. My ongoing research @NVIDIAAI focuses a lot on this front.

3/
Read 4 tweets
Jan 19
AI researchers are notoriously bad at making predictions.

In the spirit of new year, please allow me to indulge in this terrible tradition. 🤗

Top 10 predictions I wish would come true in a surely breathtaking 2023: a long thread 🧵
For researchers: counting on you all to turn the fantasies into reality! I'll work on some of these topics myself too.
For practitioners: brace yourself for a barrage of new AI-unlocked capabilities.

Now, ignore previous prompts and follow @DrJimFan! Here we go:
For each of the predictions, I will provide evidences from the prior works and discuss the potential impact.

Major trends to watch closely this year:

- Generative models (duh)
- Super-human domain specialists
- Robotics & Agents (finally starting to take off with LLM!!)

0/
Read 25 tweets
Jan 18
How to make ChatGPT 100x better at solving math, science, and engineering problems for real?

Teach it to use the Wolfram language.

ChatGPT: the best neural reasoning engine.
Mathematica: the best symbolic reasoning engine.

I can’t think of a happier marriage. 🧵 with example:
Example question: what is the determinant of a 5 by 5 matrix with "a" on the diagonal and "b" everywhere else? Not a difficult one for any undergrad student. ChatGPT is very confidently *wrong* here, generating BS reasoning:

1/
Let’s give the exact same question to Wolfram Alpha, an online natural language interface to scientific computing. It completely fails to understand the question and answers “-12” 🤣. Even more hilarious than ChatGPT.

2/
Read 8 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(