1/ On a high level, "textual inversion" is a technique of introducing new "concept" to text2img diffusion models.
In this example, diffusion model learns what this specific "<cat-toy>" is (1st img), and when prompted with "<cat-toy> in NYC", produces a coherent result (2nd img) 2/ Technically, it is a process of:
I. add one more additional token, let's call it tkn99, to model's vocab
II. freeze all weights, except tkn99's embeddings
III. run training by supplying a few example imgs with tkn99