A brief look into image generation and AI.
Imagine looking at a picture of a duck on a screen that’s all fuzzy with static.
If you narrow your eyes, you might make out the shape of the duck.
You could even try painting over the static to make the duck clearer.
Think of it like spotting a duck through a bunch of static on the screen.
You might squint, draw some lines to represent the duck, squint at those lines, and eventually create a rough picture of a duck — not exactly the same as the real duck, but sort of recognizable.
Now, picture looking at just random static on the screen, and somehow convincing yourself that there’s a duck hidden in there.
With a lot of effort, you might squint and work your way from a super fuzzy image of a duck to a less fuzzy one, until you can finally see a clear duck image..
This process is named “stable diffusion.”
How it works:
Researchers, Architects, and Scientists, took millions of images and added a bit of noise to them.
Then, they placed them in a special space and trained a model to clean them up, to remove the noise.
After that, they took those slightly noisy images and added more noise to them, and trained the computer to make them less noisy again.
They kept doing this, adding more noise each time, until the images became almost unrecognizable.
But step by step, the computer was able to imagine its way back to some sort of image — maybe not a great one or a useful one, but an image without all that noise.
what’s possible is they can train the computer to link an image in one space (let’s call it image-latent-space) to certain words in another space (word-latent-space).
This helps the computer figure out how likely it is for a picture to have a specific caption. For example, a picture of a duck is more likely to have a caption like “cute baby duck” than a caption like “the engine from a 1999 Honda Sedan.”
Important :
“cross attention,” which basically means getting different AI systems to work together.
Imagine gluing these systems together so they can do two things at the same time.
Specifically, they can ask the computer to not only remove noise from an image but also adjust that image to be more suitable for a specific caption.
That’s the whole idea behind generative AI. It’s about using these methods to create new content, like images, that fits certain criteria or styles.
So why do these models have trouble in generating hands that are proportional and accurate in representation of human hands?
why does this process create images with extra fingers and distorted hands?
The wrong way to think about it is assuming that since the model is trained on images of correct hands, it should be able to draw them correctly.
The idea of the “latent space” is important here. This is a space that includes every possible image that can be created.
The process encourages images to move away from noise and be more clear.
Additionally, there’s a push to make the image fit a particular text label. For example, if a label says
“apple,”
any image that looks like an apple could satisfy it.
When it comes to images with hands, most captions don’t focus on describing hands, especially not with specifics like the right number of fingers.
So, the system doesn’t feel a strong need to make the hands perfect. Even trying to use negative labels like “no deformed hands” doesn’t work well because there are very few images labeled as “deformed hands.”
In many generative systems, there are special parts to ensure faces look right. This is because the system might be satisfied with something just looking face-like, while we are very particular about correct face details, like eyes looking in the same direction.
Now, here’s another test of your understanding:
why do images get better when you add the name of an artist, like Lucien Freud?
Most image captions usually describe the main subject of the picture, like a person or an object.
The background might not matter much; for instance, if the focus is on a teapot, it’s fine to label it as “a teapot.” The other parts of the image aren’t as important.
An artist’s style is like the overall feel or look of the artwork.
It’s not just in one part of the image but in the entire thing.
Every pixel is influenced by the style, so the outcome of the image becomes more coherent when the artist’s style is considered.
Using text to guide image generation is challenging because while the space where text ideas are turned into image ideas is related, there are several steps involved: from the original prompt to the text-based space, and then to a function that creates images
This process can only handle a certain amount of text at a time (usually around 75 words). If there’s more than that, you need to divide the prompt into separate guiding systems that work together.
Now, is there a better way to direct image generation and get specific results?
Sure, It’s much easier to guide image creation using images themselves. Images can be accurately translated into a special space that’s designed for images.
Here are some methods people have come up with to guide image generation using images instead of words:
Create an image that could be simplified to the same basic outline as another image.
Create an image that could be simplified to the same depth map as another image.
Create an image with similar poses as another image.
Create an image with the same style as another image, content is not important.
Create an image that has the same intesity in its contrast in color to another image
Generate a color palette that is includes the complimentary colors to another image.
All of these techniques provide strong guidelines that allow for precise control over the content and arrangement of a generated image.
The only challenge with using these techniques is figuring out where to get all these guiding images. admittedly if you already have a background in art and visual design and illustration this will likely be no challenge for you at all.
It took me years to master drawing hands. I mean drawing for 6+ hours a day
Perhaps if AI image generation models where trained to include some of the rudimentary images of exercises we were taught when studying anatomy in art school perhaps it would be better at generating hands
see images below :
One aspect Holds true.
if you havent studied art or paid attention to art even from the perspective of a spectator. you may have trouble generating images that are not just homogeneous like most of the imagery flooding the internet.
Just as someone could manage to create by imitating for example: an excellent product design portfolio.
Their lack of experience of knowledge in Product Design, would immediately communicate that their knowledge was limited beyond their perception of a product designer role.
Its tempting to want to be an artist and attempt traditional methods to create the beautiful things that you see inside your head. but just like life, rarely is anything just as we imagine it.