For Setup, I setup the DeepFloyd IF diffusion model on Huggingface. DeepFloyd is a two stage model trained by Stability AI. The first stage produces images of size 64×64 and the second stage takes the outputs of the first stage and generates images of size 256×256. I made a Huggingface account, logged in, accepted the license on the model card, and generated a Hugging face Access Token that I could use in my Colab notebook to use the Diffusion Model. Because the text encoder is very large, I used a set of precomputed text embeddings to run the Diffusion Model.
The random seed I am using is 180 - I did not change the seed from its default value set in the Colab. I ran the DeepFloyd Diffusion model on three text prompts, as DeepFloyd was trained as a text-to-image model, and the results are displayed. Below are reflections on the quality of the outputs and their relationships to the text prompts. To generate images, I inputted the text embeddings through Stage 1 to create smaller images, and thenStage 2 then transformed those smaller images into larger versions, the ones displayed below (each stage had an equivalent number of num_inference steps).
1) an oil painting of a snowy mountain village: The quality of the image generated is high, as there are many fine details such as a replication of the Aurora Borealis in the sky – a phenomenon present in colder regions – and also the placement of snow on flat surfaces like the branches of the trees and the ceiling. The presence of a oil-like texture is also evident, which reflects the original text prompt as well. The quality and relation to the prompt are very high.
2) a man wearing a hat: This text prompt is vague, and the model does a great job extrapolating an image from the vague prompt that satisfies all the details. It is a man wearing a hat and the quality is high. I found it interesting how the model was able to extrapolate a type of hat and expression for the Man from just the vague prompt.
3) a rocket ship: The quality and the relation to the prompt are both high. This is a simple prompt, and a simple image results, where the model does not over-extrapolate, and generates a simple image of a rocket flying. The finer detailing like the reflection of light on the Metallic rocket surface alongside the cloud created by the flames is also high quality.
I also experimented with difference values for num_inference steps from the default value of 20 to a significantly lower value of 10. With the lower num inference steps, there were subtle and noticeable changes in quality, color expression, and detailing in the generated images that I observed.
In this part of the project, I wrote my own sampling loops that use pretrained DeepFloyd denoisers to produce high quality images. I then modified these sampling loops to perform tasks like inpainting and optical illusion. A diffusion model reverses this process by denoising an image at each timestep, eventually arriving at t=0 with a prediction of the original image without noise.
I implemented the noisy_im = forward(im, t) function. A key part of diffusion is the forward process, which takes a clean image and adds noise to it, defined by q(x_t|x_0) = N(x_t; sqrt(a) * x_0, (1 - a)I). Given coefficients alpha for each timestep t (the alphas_cumprod variable) dictating how the noise augments the image and a clean image x_0, we get noisy image x_t by sampling from a Gaussian with mean sqrt(a) * x_0 and variance (1 - a)I. As can be seen in the equation, the forward process not only adds noise, it also scales the image. Below the results for t = [250, 500, 750] are shown, and at increasing timesteps, the image gets noisier.
Next, I tried to denoise the forward images I created in Part 1.1 using a classical method of Gaussian Blur Filtering, while convolves a Gaussian Kernel Filter with a Standard Deviation value, smoothing out high frequency elements of an image, and theoretically smoothing out the noise we added during the forward process. But as you can see in the results, it gets harder and more difficult to denoise the image using the Classical denoising method as timestep t increases in the forward process, leading to not the best possible results. This can be intuitively understood as the noise becomes the pixel value of the image in the image array, so as the Gaussian Filter attempts to smooth the noise, it has less and less real image data values to pull from. I used the torchvision.transforms.functional.gaussian_blur library.
Now instead of using a classical denoising method of Gaussian Blurring, I used a pretrained Diffusion Model to denoise the image. This UNet has already been trained on a large dataset of (x_0, x_t) images so intuitively it already has information on how the noise is placed into the image, and how different patterns of noise correspond to the original image. This means I can use it to denoise the image at different timesteps t in the forward process. The UNet recovers Gaussian Noise in the image, and then we can use that noise to recover an estimate of the original image. As the Diffusion model was trained with text conditioning, we also need a text prompt embedding, so I use "a high quality photo". This is known as a one-step denoising process with a UNet, as we are estimating the original image from timestep t in one step rather than gradually estimating the original image at each timestep t to t-1. The results are displayed below, and you can tell that in comparison to the Gaussian Classically denoised photos above, the UNet is able to recover a lot more detail and reduce the amount of noise present significantly. However, it still is under par and we can get better results and more finer detailing and edges with iterative denoising.
In part 1.3, we see that the denoising UNet does a much better job of projecting the image onto the natural image manifold, and recovering details from the noised images resulting from the Forward processes in comparison to the classical Denoising Methods of using Gaussian Blur Kernels. However, the effectiveness one-step denoising method using a UNet decreases as the timestep t in the forward process increases and more noise is added to the images. In this part, we design a diffusion model to denoise the image iteratively rather than just using one step. Since Diffusion models are designed to denoise iteratively, instead of denoising iteratively at each individual step however, which would be computationally intensive and expensive, I denoised at strided timesteps T which skip timesteps in the range(start, end) with a defined stride. On the ith denoising step we are at strided_timesteps[i], and want to get to strided_timesteps[i+1] (from more noisy to less noisy as the strided timesteps array starts from the end and goes to t=0 or a clean image). We can think of this process as a linear interpolation between the signal and noise. As you can see in the results, the iteratively denoised result is better than the one-step and Gaussian results, as it has more detailing around the trees and structures.
The equation for this process that I used is below: $$x_{t'} = \frac{\sqrt{\bar{\alpha}_t \beta_t}}{1 - \bar{\alpha}_t} x_0 + \frac{\sqrt{\alpha_t (1 - \bar{\alpha}_t)}}{1 - \bar{\alpha}_t} x_t + \nu_\sigma$$
In part 1.4, I used the diffusion model to denoise an image iteratively. Another thing we can do with the iterative_denoise function that I wrote is to generate images from scratch. I did this by setting the i_start variable = 0 and passing in random noise as the original image to denoise, meaning there was no clear starting point from which the model was to denoise, resulting in the ability to recover a whole new image from the random noise! The prompt for the text embedding used is again "a high quality photo". 5 results denoising pure noise into new images are displayed below. The quality isn't super good, as we will see in the next section, but it still produces respectable, comphrensive output.
The generated images in the previous part which iteratively denoised pure random noise created by torch.randn were not very clear, and realistic, not reflecting real-world scenarios, and at times being abstract. To improve image quality, while decreasing image diversity, I used a technique known as a CFG (classifier free guidance). In CFG, as an extra step to 1.5's iterative denoising function, we compute a unconditional and conditional noise estimate. We then let our new noise estimate be calculated according to a scale s: new noise est = uncond noise est + (1 - scale) * cond noise est, and when scale > 1, you get high quality images. I used the null prompt "" for the unconditional noise estimate for CFG. The results below as you can see are significantly more clear, coherent, and realistic.
In part 1.4, we take a real image, add noise to it, and then after that is complete, we denoise. Intuitively, because of the noise, our model does not exactly recover the original image, rather it "guesses" or "estimates" portions of the image that were noised using its training data set. This effectively allows us to make edits to existing images by adding noise to them, and then iteratively denoising. This effectively allows us to make edits to existing images. The more noise we add, the larger the edit will be. The DeepFloyd diffusion model estimates new features and edits as it iteratively denoises the image under CFG (classifier free guidance). In part 1.7, I took the original test image (along with 2 images of my choosing representing other historical monuments), noised them using our Forward process, and then denoised them using our iterative_denoise_cfg at different starting indices [1, 3, 5, 7, 10, 20], and the results are displayed below. This gives a range of images which gradually look more like the original image, as we have more timesteps to iteratively denoise in the iterative_denoise_cfg function.
We can run the same procedure on nonrealistic images (cartoons, hand-drawn images) and project them onto the natural image manifold. We start by noising the images using the Forward process which samples random noise using torch.randn, and then iteratively denoising the images at different start indices using CFG (classifier free guidance) to effectively make edits to our images. Similar to the previous part, we generate ranges of images which gradually start to resemble the original image more, as there are more timesteps to iteratively denoise the image. I downloaded one image of a Strawberry off the web, as I wanted to see how our algorithm handled the general, oval shape of a Strawberry, and I hand drew two rectangular images of a house and a TV to see how the model would handle rectangular images. I ran the algorithm for the same noise levels as above [1, 3, 5, 7, 10, 20] while also adding 25, as I wanted to experiment with a higher start index to see its effect (it led to similar images to the originals). The hand drawn images were center cropped so there is a bit of warping present in dimensions.
We can use the same procedure as the previous parts to implement inpainting following the RePaint paper. We can repaint just a single section of the image using a binary mask, with 0s and 1s – with 1s in the original image area and 0s in the area you want to replace. You then use (1 - binary mask) to set the pixels in the area you want to replace which is output from the iterative denoising function. This leads to the equation x_t <- x_t*m + (1-m)*forward(x_0, t). Then you denoise the resulting image from the mask, so the original image is preserved, intuitively just "repainting" one area. Below, I inpainted three images (the test image of the Campanile alongside two images of my choosing). I modified the size and position of the masks to inpaint the images in unique ways. The results are displayed below.
In this part, we run the same algorithm as SDEdit, but with an additional step of guiding the projection and result of the iterative denoising function with a text prompt. As a result, the algorithm is no longer a "pure" projection to the natural image manifold, as it also adds control using language. The prompt I used was "a rocket ship", which was a precomputed text embedding, as it is computationally intensive to compute one ourselves on the Google Colab with CPU limits. In summary, I changed the prompt from "a high quality photo" to "a rocket ship" to generate a range of images which gradually transition from a rocket to the original image, run on 3 images.
In this part, we implement Visual Anagrams and create optical illusions with diffusion models. To do this, we basically denoise an image x_t at timestep t in the forward noising process with the prompt A to obtain noise estimate noise_A, but at the same time, we flip the noised image from timestep t of the forward process and denoise with prompt B! We can flip the noise B estimate and average the two noise A and B estimates. The final step is to perform a reverse diffusion step with the averaged noise estimate result we got in the previous step, giving us a visual anagram!
I tried 3 combinations of prompts to create visual anagrams: 1) "an oil painting of people around a campfire" and "an oil painting of an old man" is displayed. 2) "a lithograph of a skull" and "a lithograph of waterfalls" 3) "a lithograph of waterfalls" and "an oil painting of a snowy mountain village".
In this part, we implement Factorized Diffusion and create hybrid images similar to Project 2. To create hybrid images with a diffusion model, we can use a similar technique as above, creating a composite noise estimate by estimating the noise with two different text prompts, and then combining low frequencies from one noise estimate, and high frequencies from the other. The low and high frequency blending is exactly similar to the algorithm that we implemented to create hybrid images in Project 2!
Below are results of hybrid images from different text prompts: 1) skull far away, waterfall close up, 2) rocket far away, pencil close up, 3) man in hat far away, dog close up
The goal of part 1 is to build a simple one step denoiser. Given a noisy image z, I trained a denoiser D such that it maps z to a clean image x. The denoiser optimizes over L2 loss so L2(D(z) - x) or the norm of the vector created by subtracting the denoised image and the original image from the training set. The denoising UNet we build will allow us to denoise images using a Diffusion model we built from scratch!
The Denoiser will be implemeneted as a UNet, where it will consist of a few downsampling and upsampling with skip connections (which are not direct connections between blocks, but rather additive connections that traverse and go anywhere along the path of our UNet's operations.
At a High Level, the blocks do the following: (1) Conv is a convolutional layer that doesn't change the image resolution, just the channel dimension, (2) DownConv is a convolutional layer that downsamples the tensor by 2 (3) Upconv is a convolutional layer that upsamples the tensor by 2 (4) Flatten is an average pooling layer that flattens a 7x7 tensor into a 1x1 tensor. We chose these dimensions because 7 is the resulting height and width after the downsampling operations (5) Unflatten is a convolutional layer that unflattens a 1x1 tensor into a 7x7 tensor (6) Concat is a channel-wise concatenation between tensors with the same 2D shape. We can use the torch.cat() from PyTorch to accomplish this result.
I also defined composed operations that combine multiple operations at once to make our model parameters more learnable. (1) ConvBlock combines two convolutions (2) DownBlock combines a DownConv and a Conv (3) Upblock combines UpConv and Conv. The total architecture is visualized below.
The problem we are trying to solve with our UNet is to train a denoiser D_theta such that it maps z to a clean image x using L2 Loss between the predicted unnoised image and the original image. To calculate this L2 loss, we can generate training pairs of (z, x) pairs where z is a noised version of the clean MNIST digit image x. I generated z from x using the noising process that is defined by z = x + sigma * noise where noise is sampled from N(0, I) and sigma is a parameter set. Below is a plot that visualizes different noising processes over different values of sigma = [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0], assuming normalized noise in [0, 1]. The plot is below.
To train the model UNet that we defined earlier in Part 1.1, we will noise images with sigma = 0.5 in the MNIST dataset taken from the torchvision.datasets.MNIST function training and test datasets. I shuffled the dataset into train and tets before creating the DataLoader, and trained on only the training set. The parameters used for training were batch size = 256, epochs = 5, hidden dimension D = 128, and learning rate = Adam & 1e-4. Intuitively, how this training process works is that it first calculates loss on the training pairs and the output of the model (with some default initialized parameters for the model). It calculates the gradient of the Loss Function with respect to the current parameters (as it points in the directon of steepest ascent). It then subtracts the gradient * our learning rate from the parameters, moving closer every batch/epoch to a global minimum for the Loss function on the training data, at which point, the model will be trained and ready to go!
Below is a graph of the loss function on the training data over each step in the training process. I also visualized the denoised results on images from the test set from the 1st and 5th epoch model parameters (as expected, the 5th epoch results are substantially clearer, as loss is optimized using gradient descent).
Our denoiser was trained on MNIST digits noised with sigma = 0.5. We can visualize how the denoiser performs on the other different values of sigma that it wasn't trained for. Intuitively, sigma's below 0.5 will lead to good results, but above will lead to lower quality results, as there is more noise present in the image than what the UNet denoiser was trained for.
In Part 2, I trained a UNet Model that can iteratively denoise an image. In Part 1, we solved the problem of optimizing L2 Loss over the predicted denoised images and the original images from the MNIST dataset. In Part 2, I introduced a small difference, training the UNet to predict the added noise instead of the clean image directly. These are mathematically equivalent due to the equation x = z - sigma * noise. Therefore, the L2 Loss we were previously optimizing becomes L2 Loss(predicted noise - noise).
One-step denoising does not give high quality results (as seen in Project 5A), so instead, we iteratively denoise the image by sampling a pure noise image and iteratively denoising it to create a realistic image. When t=0, x_t is the clean image x_0, and when t=T, x_t is pure noise. Any x_t inbetween [0, T] will be a linear combination of the two. Because in 5B, we are working with simple MNIST digits, and not real-life images, we set our value for timesteps T in the iterative denoising process to 300 instead of 1000 in part A. Since the variance of x_t varies iwth t, we condition a single UNet with timestep T instead of training T separate UNets. After this whole process as detailed above is done, we will have an iteratively denoising Diffusion UNet model!
To inject t as discussed in the previous section and condition our UNet model, we use a new operator called FCBlock (fully-connected block) which we use to inject the conditioning signal into the UNet. FCBlock has Linear(F_in, F_out) with F_in input and F_out output features implemented using nn.Linear.
Training the time-conditioned UNet is an iterative process: I picked a random image from the MNIST training dataset, a random t, and trained the denoiser to predict the noise in x_t. I repeated this process for different images and t values until the model converged to a global minimum and gave high quality denoised results. The Hidden dimension D was set to 65, and the Learning rate = Adam & 1e-3 with a exponential learning rate decay scheduler. The batch size was 128 and the number of epochs for training was 20 (as this was a more difficult task than Part 1). Below is a graph of the training losses for the UNet during each step of the training process.
The sampling process is very similar to Part A, except that we don't need to predict the variance like in the DeepFloyd model. Instead, we just use our list. I visualized sampling many different images in a grid at 5 epochs and 20 epochs model parameters. As expected, at 20 epochs, the results were better to an optimized model and loss. I also included a training loss curve plot for the time conditioned UNet over the whole training process.
To make the results better and give more control for image generation, we can also condition our UNet on the digits 0-9, adding two more FC Blocks (as previously defined). I did this as the results in the previous section visualized in the grid were suboptimal for denoising. To make it more efficient, we make it a one-hot vctor instead of a single scalar. In addition, because we still want our UNet to work without it being conditionedon the class, we can implement dropout, where 10% of the time, we drop the class conditioning vector by setting it equal to 0. The training method for this UNet modified is the same as time-only with the conditioning and unconditional generation being the only difference.
The sampling process is the same as Part A, where we saw that the conditional results aren't good unless we use Classifier Free Guidance (which led to significant improvements). To improve the sampling results, I used CFG with a scale of 5.0. The results for this section are visualized in a similar grid format to 2.3!
Overall, I learnt a lot & especially since GenAI is so relevant right now, learning about the intricacies of Diffusion Models and Denoising Convolutional Neural Networks was valuable and rewarding!