You've probably seen these strangely beautiful AI-generated images on Twitter. Have you wondered how they are created?
In this thread, I'll tell you about a method for generating art with ML known as VQGAN+CLIP.
Let's jump in π
Short History π
In January @OpenAI publicly released CLIP, which is a model that allows matching text to images.
Just days after that, some people like @advadnoun, @RiversHaveWings, and @quasimondo started experimenting using CLIP to guide the output of a GAN using text.
π
OpenAI published an image generation model together with CLIP, called DALL-E, but without the full code and the pre-trained models.
The results from guiding StyleGAN2 or BigGAN with CLIP aren't as accurate as DALL-E, but they are weirdly artistic.
When CLIP was paired with a recent model released by the University of Heidelberg in December 2020 called VQGAN, the quality of the generated images improved dramatically.
VQGAN+CLIP is one of the most commonly used methods today.
To start you need to specify how the final image should look like - the text prompt. You can use natural language for this and you can define several prompts as well.
Let's try the following prompt:
"a car driving on a beautiful mountain road"
Not bad!
π
However, let's say we want to see some sky on the top and we want the car to be nicer. We can add 2 more prompts describing the scene better.
"a car driving on a beautiful mountain road"
"sky on the top mountains on the sides"
"a beautiful sports car"
Much better!
π
You can now apply some tricks to improve the visual style of the image. For example, you can use the so-called "unreal engine trick" found by @arankomatsuzaki. You just add another prompt saying "unreal engine" and you will get a more realistic-looking image!
Nice! π
The process of tuning the text description is called prompt engineering. CLIP is a very powerful model, but you need to ask it in the right way to get the best results. Try out different things!
Here is an example of the styles you can achieve with different prompts.
π
Start image
By default, the generation will start from a random noise image. However, you can provide your own image to start the process. This will make the final image look much more like it, so it is a way to guide the model.
Example with the same text prompt
π
Now, let's look a bit under the hood. What happens is the following:
1οΈβ£ VQGAN generates an image
2οΈβ£ CLIP evaluates how well the image fits the text prompt
3οΈβ£ Do a backpropagation pass to the VQGAN
4οΈβ£ Go back to 1οΈβ£
π
CLIP
The really cool thing about CLIP is that it can take an image or some text end encode them in an intermediate (latent) space. This space is the same for both, so you can then compare how similar an image is to a sentence. This itself is a π€― achievement.
π
VQGANs
VQGAN is also an interesting method because for the first time it allows the powerful vision transformers to be efficiently used on high-resolution images by decomposing the image into an ordered list of entries from a codebook.
π
So, we are basically training the VQGAN to produce images that will fit the text prompts. CLIP is used as a powerful judge to guide the GAN in the right direction in order to achieve good results.
The expressive power comes from the vast knowledge CLIP contains and can use.
π
Unfortunately, this also means that the whole process is rather slow and needs a powerful GPU with lots of RAM. The experiments above were done in a Google Colab Pro notebook with an Nvidia P100 GPU and each image (1000 iterations) takes about 15 minutes to create.
π
Another interesting feature is that you can take all the intermediate images and create a cool-looking video!
Check out this video of my Halloween pumpkin using the following prompts:
"a scary orange jack-o-lantern"
"red fire background"
π
So, that's it for now. In the next thread, I'll tell you how I used this method to create an NFT collection and earn more than $3000 in 2 weeks.
There is a problem with how value is distributed in online communities today. It seems we take the status quo for granted and don't discuss it much.
The people that create most of the value, get none of the money! Only badges...
Thread π
Online communities
I'm talking about platforms like Twitter, Reddit, Stack Overflow etc. They're wonderful places, where you can discuss interesting topics, get help with a problem, or read the latest news.
However, the people that make them truly valuable receive nothing π
It usually looks like this:
βͺοΈ Company creates a web 2.0 platform
βͺοΈ Users create content and increase the value
βͺοΈ Company aggregates the demand
βͺοΈ Company monetizes with ads and subscriptions
βͺοΈ Company gets lots of money
βͺοΈ Creators get badges, karma and virtual gold
This is the formula for the Binary Cross Entropy Loss. This loss function is commonly used for binary classification problems.
It may look super confusing, but I promise you that it is actually quite simple!
Let's go step by step π
The Cross-Entropy Loss function is one of the most used losses for classification problems. It tells us how well a machine learning model classifies a dataset compared to the ground truth labels.
The Binary Cross-Entropy Loss is a special case when we have only 2 classes.
π
The most important part to understand is this one - this is the core of the whole formula!
Here, Y denotes the ground-truth label, while ΕΆ is the predicted probability of the classifier.
Let's look at a simple example before we talk about the logarithm... π
ROC curves measure the True Positive Rate (also known as Accuracy). So, if you have an imbalanced dataset, the ROC curve will not tell you if your classifier completely ignores the underrepresented class.