Fun with Diffusion Models

Part A.0: Setup & Text Prompts

In part A of the project, we experiment with diffusion models, sampling loops, and use them for tasks such as inpainting and creating optical illusions.

First, we obtain access to the DeepFloyd model and generate our own text prompts which we use to create images. I ran the model using a random seed value of 180 and tried out several different combinations for num_inference_steps using 20 and 200. At num_inference_steps = 20, the outputs were decent but not impressive. It was somewhat obvious that the outputs were AI-generated as they didn't look very realistic (especially due to the colors) and didn't have much texture / detail. Meanwhile, at a value of 200, the outputs had much more detail but the colors still looked somewhat exagerrated, especially for the pasta prompt. For the airport prompt, I think it could have done a better job making it clear that it was an airport in Japan specifically, and for the "man in love" prompt, I think it wasn't super obvious that the person was in love (except for maybe the first output with num_inference_steps = 20) -- it seemed like a normal glance and the expressions didn't look soft. I also tried a num_inference_step value of 400 and interestingly, there were visible small circles in the outputs.

Results for num_inference_steps = 20

Results for num_inference_steps = 200

Results for num_inference_steps = 20 (stage 1) & 200 (stage 2)

A.1.1: Implementing the Forward Process

Next, we move on to sampling loops by making use of the pretrained DeepFloyd denoisers. In this first subpart, we prepare ourselves for denoising by iteratively adding noise to a clean image. That is, given a clean image, we sample from a Gaussian (using mean and variance), apply scaling, and get a noisy image at a particular timestep. This process is stored in a "forward" function, and here are the results on an image of the campanile:

Campanile at various noise levels

A.1.2: Classical Denoising

Next, we are ready for denoising. In classical denoising, we use Gaussian blur filtering to try to remove noise. After applying forward on our images, we use torchvision.transforms.functional.gaussian_blur to apply the gaussian blur filter. Here I used a kernel size of 5, and as we can see, the image looks smoother but the essence of the campanile remains destroyed:

Campanile with Gaussian denoise applied

A.1.3: One-Step Denoising

Then, we use a pretrained diffusion model (stage_1.unet) to denoise. Aside from applying forward to our campanile image, we estimate the noise in the new noisy image by passing it through stage_1.unet. Using this, we can then remove noise from the noisy image to obtain an estimate of the original (clean) image. Importantly, we don't remove the noisy directly; we must scale the noise before removing since the forward process adds a *scaled* version of the noise. The results do no disappoint:

Original Campanile Photo

Campanile with one-step denoising

A.1.4: Iterative Denoising

Although 1.3 is great, we can see that the quality of the results deteriorate as the images become noisier. To address this, we try denoising in an iterative manner -- using timesteps. However, we do not want to go through all steps from 1 to n as that would be super slow, so we take our steps in strides of 30. This is done by creating a function that denoises our image at some starting timestep index (timestep[i_start]), applying a mathy formula to obtain an image at the next timestep (t' = timestep[i_start + 1]), and repeating the process until we produce a clean image. Here are some figures for comparison:

A.1.5: Diffusion Model Sampling

Aside from denoising images, we can also use the denoising model to generate images from scratch. This is done by setting our starting index / timestep as 0 and in our noisy image as random noise, which in effect denoises pure noise. To visualize the results, we take 5 samples of the prompt "a high quality photo".

Diffusion Sample Results

A.1.6: Classifier-Free Guidance (CFG)

The results in 1.5 were not great, but we can improve this with CFGs! In CFG, we compute both a conditional and unconditional noise estimate then scale their difference and sum them up to "condense" them to a sinlge noise. Our scale controls the strength of the CFG (how closely we want our generated images to follow our text prompt). These are the results on the same "high quality photo" prompt using a scale value of 7. For all subsequent subparts, we use CFGs.

CFG Results

A.1.7: Image-to-image Translation

Now that we have CFGs, we can get a little more creative! Here, we generate an image that is similar to the campanile by applying a series of edits to a random nose-generated image. We then take the original campanile image, add a little bit of noise using forward, and run iterative_denoise_cfg from 1.6. This produces a series of edits to our random image, each step getting closer and closer to original campanile. We also try the process on some other images, producing really cool results!

Campanile Edits

Reference image (grad photo)

Edits on grad image

Reference image (bear)

Edits on bear image

1.7.1: Editing Hand-Drawn and Web Images

We can also apply the process on illustrations! Here are the edits made on web and hand drawn images.

Original Sundae Kids Illustration (Web)

Sundae Kids Illustration Edit Result

Original Doodle #1 (Hand-Drawn)

Doodle #1 Edit Result

Original Doodle #2 (Hand-Drawn)

Doodle #2 Edit Result

1.7.2: Inpainting

Another cool application of this is image inpainting. Here are the results on a few images (campanile, rose, grad photo scene from Berkeley's cherry blossom trees).

1.7.3: Text-Conditional Image-to-image Translation

Aside from passing in "a high quality photo" as our prompt, we can give our model a specific prompt, and as we can see, it will make edits on an image that matches our prompt instead of a random image. By doing so we are essentially guiding our model with a text prompt. Here are some fun results:

Results on Campanile (from woman in black dress)

Results on rose (from plate of pasta)

Results on bear (from countryside sunset)

A.1.8: Visual Anagrams

Another cool application of CFGs is that it allows us to create optical illusions. Here we create anagrams by denoising an image normally using one prompt, giving us a noise estimate e1, then flipping the image upside down and denoising it with another prompt to get a second noise estimate e2. Then we flip e2 back and take the average of the two noises so that we have a noise e that captures the essence of both prompts. Below are anagram results for prompts "a beautiful butterfly" and "a painting of a fish" (top set) as well as "a painting of an airplane" and "an oil painting of a white boat on a blue sea" (bottom set).

Anagram of butterfly and fish

Anagram of airplane and boat

A.1.9: Hybrid Images

Lastly, we experiment with hybrid images by again creating composite noise estimates. This is done by combining the low frequencies of one noise estimate with the high frequencies of another. Below are hyrid results for prompts "a photo of city skyscrapers" and "a beautiful butterfly" (left) as well as "a painting of a fish" and "a tropical ocean with waves" (right).

Hybrid of skyscraper (far) and butterfly (close)

Hybrid of fish (far) and tropical ocean (close)

B.1.1: Implementing the UNet

In part B we try out alternative denoising and image sampling techniques by turning to UNets and flow matching. In this first part, we build a one-step denoiser with the following network architecture and specifications:

Conv2d(kernel_size, stride, padding) is nn.Conv2d()
BN is nn.BatchNorm2d()
GELU is nn.GELU()
ConvTranspose2d(kernel_size, stride, padding) is nn.ConvTranspose2d()
AvgPool(kernel_size) is nn.AvgPool2d()
D is the number of hidden channels and is a hyperparameter that we will set ourselves.

At a high level, the blocks do the following:

(1) Conv is a convolutional layer that doesn't change the image resolution, only the channel dimension.
(2) DownConv is a convolutional layer that downsamples the tensor by 2.
(3) UpConv is a convolutional layer that upsamples the tensor by 2.
(4) Flatten is an average pooling layer that flattens a 7x7 tensor into a 1x1 tensor. 7 is the resulting height and width after the downsampling operations.
(5) Unflatten is a convolutional layer that unflattens/upsamples a 1x1 tensor into a 7x7 tensor.
(6) Concat is a channel-wise concatenation between tensors with the same 2D shape. This is simply torch.cat()

(Unconditional) UNet Architecture

UNet Operations

B.1.2: Using the UNet to Train a Denoiser

After building our network, we can formulate a denoising objective: given some noisy image z, we'd like to train a denoiser D that can map z to a clean image x. Similar to part A, we start by adding noise to our clean images (MNIST digits) by generating training data pairs (z, x) with z = x + σe where e follows the Normal distribution from 0 to our image I. Here are the results across various σ values and digits:

1.2.1: Training

Afterwards, we can train a denoiser D. Here we train with σ=0.5, a batch size of 256, 5 epochs, UNet with hidden dimension of 128, and Adam optimizer with 1e-4 learning rate.

Sample results on test set

Training loss curve

1.2.2: Out-of-Distribution Testing

Aside from visualizing the results on our trained sigma value of 0.5, we can also examine the results on other sigma values that it wasn't trained on. As we can observe, it seems to perform worse on larger sigma values.

1.2.3: Denoising Pure Noise

We can also denoise pure noise by passing in a "blank canvas" of z = e. This is essentially the same process as 1.2.1 but instead of passing in digit images we use pure noise. As we can observe, the outputs look somewhat ghost-like and pretty much the same across all of our noise inputs. Intuitively, this is because we are computing the "average" of output digits 0-9 as there is no digit it is conditioned on.

Sample results on test set

Training loss curve for pure noise

B.2.1: Adding Time Conditioning to UNet

As we just saw in 1.2.3, denoising does not work well with pure noise (generative tasks). To address this, we can leverage flow matching which will denoise our digits iteratively. This difference with this is that we train a UNet model that essentially predicts the 'flow' from our noisy data to clean data. Intermediate noise samples are constructed using linear interpolation. To introduce conditioning, we use time (denoted as t below).

Time Conditional UNet

FCBlock (used for conditioning)

B.2.2: Training the UNet

Training is fairly straightforward here: after picking a random image from the training set and a random timestep, we can add noise to our random image to get xt. Then we train the denoiser to predict the flow at xt and repeat for different images and timesteps until the model converges. Now we use a batch size of 64, a hidden dimension of 64, Adam with 1e-2 learning rate, and a scheduler using torch.optim.lr_scheduler.ExponentialLR(...).

Training Loss Curve for Time-Conditioned UNet

B.2.3: Sampling from the UNet

As soon as training is done, we can visualize our model's outputs by sampling a few results:

Sampled Results for Epochs 1, 5, 10

B.2.4: Adding Class-Conditioning to UNet

As we can see in 2.3, the results were not great. To address this, we add class conditioning to our UNet, which is as simple as adding 2 more FCBlocks to our existing architecture along with a dropout rate of 10%.

B.2.5: Training the UNet

Training is fairly similar to time conditioned UNets, except we condition on our vector of digit classes and incorporate our dropout logic so that we perform unconditional generation from time to time.

Training Loss Curve for Class-Conditioned UNet

B.2.6: Sampling from the UNet

Simiar to before, we can sample the results after training. Here we use a gamma value of 5.0. In an attempt to avoid the need of a scheduler, we also repeat the training process on a loop that removes the exponential scheduler while decreasing the learning rate to 2e-2 and increasing the epoch count to 15. The results are comparable to our previous loop that uses a scheduler.

Sampled Results for Epochs 1, 5, 10

Sampled Results for Epochs 1, 5, 10 (No Scheduler)