2/ First off, finding the right combination of prompt, seed and denoising strength for an #img2img in-painting is a roll of the dice
Luckily it is easy to script large batches to cherrypick
3/ The first and last pairs were just regular #img2img ramped through a range of denoising strength of 0 to 0.8
4/ Transitions were done using a customized @huggingface 🧨Diffusers pipeline.
This lets me “slerp” between both noise latents AND text embeddings, for each given seed & prompt respectively
(while keeping denoising strength at ~0.8)
5/ Some tricks were required with blending and adjusting the inpainting mask to smoothly switch over the init images of the two real phones
(example generations on the right)
6/ Not all walks through the latent space were a smooth path, but it’s easy to script it to find pairs that work well (and let your GPU replace your central heating)
Having the ability to play with these models on this level is incredible.
More creative AI experiments to come!
• • •
Missing some Tweet in this thread? You can try to
force a refresh
I “jailbroke” a Google Nest Mini so that you can run your own LLM’s, agents and voice models.
Here’s a demo using it to manage all my messages (with help from @onbeeper)
🔊 on, and wait for surprise guest!
I thought hard about how to best tackle this and why, see 🧵
After looking into jailbreaking options, I opted to completely replace the PCB.
This let’s you use a cheap ($2) but powerful & developer friendly WiFi chip with a highly capable audio framework.
This allows a paradigm of multiple cheap edge devices for audio & voice detection…
& offloading large models to a more powerful local device (whether your M2 Mac, PC server w/ GPU or even "tinybox"!)
In most cases this device is already trusted with your credentials and data so you don’t have to hand these off to some cloud & data need never leave your home
I wanted to imagine how we’d better use #stablediffusion for video content / AR.
A major obstacle, why most videos are so flickery, is lack of temporal & viewing angle consistency, so I experimented with an approach to fix this
See 🧵 for process & examples
Ideally you want to learn a single representation of an object across time or different viewing directions to perform a *single* #img2img generation on.
This learns an "atlas" to represent an object and its background across the video.
Regularization losses during training help preserve the original shape, with a result that resembles a usable slightly "unwrapped" version of the object
We are getting closer to “Her” where conversation is the new interface.
Siri couldn’t do it, so I built an e-mail summarizing feature using #GPT3 and life-like #AI generated voice on iOS.
(🔈Audio on to be 🤯with voice realism!)
How did I do this? 👇
I used the Gmail API to feed in recent unread e-mails into a prompt and send to the @OpenAI#GPT3 Completion API. Calling out details such as not “just reading them out” and other prompt tweaks gave good results
@OpenAI Here are the settings I used, you can see how #GPT3 does a great job of conversationally summarizing. (For the sake of privacy I made up the e-mails shown in the demo)
I used AI to create a (comedic) guided meditation for the New Year!
(audio on, no meditation pose necessary!)
Used ChatGPT for an initial draft, and TorToiSe trained on only 30s of audio of Sam Harris
See 🧵 for implementation details
ChatGPT came up with some creative ideas, but the delivery was still fairly vanilla, so I iterated on it heavily and added a few Sam-isms from my experience with the @wakingup app (Jokes aside - highly recommended)
@wakingup Diffusion models & autoregressive transformers are coming for audio!