Tweet

Justin Alvey

Follow @justLV

Feb 2 • 5 tweets • 4 min read

We are getting closer to “Her” where conversation is the new interface.

Siri couldn’t do it, so I built an e-mail summarizing feature using #GPT3 and life-like #AI generated voice on iOS.

(🔈Audio on to be 🤯with voice realism!)

How did I do this? 👇

@OpenAI

I used the Gmail API to feed in recent unread e-mails into a prompt and send to the @OpenAI #GPT3 Completion API. Calling out details such as not “just reading them out” and other prompt tweaks gave good results

@OpenAI

@OpenAI Here are the settings I used, you can see how #GPT3 does a great job of conversationally summarizing. (For the sake of privacy I made up the e-mails shown in the demo)

@coqui_ai

The audio model was fine-tuned on speech from the movie Her.

I got good results with TorToiSe, but have also experimented with ViTS & YourTTS from @coqui_ai and more recently @ElevenLabs.

None are fast enough for a snappy response together with da-vinci-003 completions, so...

I have a script running on a server at home that checks for e-mails, then prepares and then serves the latest generated audio for me on an endpoint. It's as simple as then setting up a Siri Shortcut in iOS to retrieve and decode the audio

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @justLV

Justin Alvey

@justLV

Jan 3

I used AI to create a (comedic) guided meditation for the New Year!

(audio on, no meditation pose necessary!)

Used ChatGPT for an initial draft, and TorToiSe trained on only 30s of audio of Sam Harris

See 🧵 for implementation details

@wakingup

ChatGPT came up with some creative ideas, but the delivery was still fairly vanilla, so I iterated on it heavily and added a few Sam-isms from my experience with the @wakingup app (Jokes aside - highly recommended)

@wakingup

@wakingup Diffusion models & autoregressive transformers are coming for audio!

Text-To-Speech was created using github.com/neonbjb/tortoi…

I also highly enjoyed reading the author's blog nonint.com

Read 5 tweets

Justin Alvey

@justLV

Dec 20, 2022

https://twitter.com/karenxcheng/status/1605276493675995138

I used the #StableDiffusion 2 Depth Guided model to create architecture photos from dollhouse furniture.

By using a depth-map you can create images with incredible spatial consistency without using any of the original RGB image.

See 🧵

https://twitter.com/karenxcheng/status/1605276493675995138

2/ This model is unique as it was fine-tuned from the Stable Diffusion 2 base with an extra channel for depth.

Using MiDaS (a model to predict depth from a single image), it can create new images with matching depth maps to your "init image"

3/ I set the denoising strength to 1.0 so that none of the original RGB image was used

Even with widely different prompts it was able to generate consistent objects

Using simple, recognizable shapes such as wooden doll-house furniture worked great for this

Read 8 tweets

Justin Alvey

@justLV

Nov 1, 2022

https://twitter.com/karenxcheng/status/1587510079770615809

1/ I created this with Stable Diffusion using image inpainting and “walking through the latent space”

Without using tweening, every frame is generated by an interpolated embedding and variable denoising strength, so keeping continuity was tricky

See 🧵for process

https://twitter.com/karenxcheng/status/1587510079770615809

2/ First off, finding the right combination of prompt, seed and denoising strength for an #img2img in-painting is a roll of the dice

Luckily it is easy to script large batches to cherrypick