A BROAD AND SIMPLIFIED VIEW OF HOW SORA AI WORKS

The hype around open AIs new video generation AI called Sora is at its peak, as it should be being such a state-of-the-art model.

But do you know how it works? Well, this article will run you through how sora works from a high level. I will try to simplify the technical terminologies here as much as possible.

So basically, the user enters a prompt and the AI generate a minute long high-quality video based on that prompt’s instructions.

so first, it needs to understand the context of the prompt, then it has to generate a clip which contains everything the user mentioned in the prompt.

So, to process the prompt or the user instructions, something called "transformers" are used.

For the video generation, "diffusion models" are typically used.

Then we will see how sora uses a "diffusion transformer".


So, let's start with the transformer first:

Imagine this something like this:

In order to understand a foreign language you break it down word by word then, convert each word into your native language, and make sure to keep in mind how the words are ordered in the sentence (since same words in different order can completely change the meaning of the whole sentence) and after understanding the meaning of the sentence you repeat this order to give back a response.

That’s pretty much what a transformer is.

1.Breaks the prompt word by word or in tokens which is called tokenization. Embeddings are numerical values assigned to each word since a machine can understand numbers better. Positional encoding is done to keep track of the order of the words.

2.Transformer block: This is where the model understands the context.
    - Self attention mechanism : It is used to identify and weigh the importance of different parts of an input sequence. 
    - Feedforward neural network : It is used to predict the next words.

3.Softmax layer : coverts the raw output into a meaningful probability distribution.

Hope this gives you an idea of how the machine is able to understand and respond to user prompts.

Now let’s see what diffusion models are:

“In a diffusion model, noise is added to an image and then the model learns to denoise it to restore the original image”

Let’s take an example:

Imagine there is a person talking to you who intentionally misspells or modify some specific words while talking. Now this “mispronunciation” or “modification” can be considered as noise. Considering that the person is talking to you in a language you’re familiar with, you can easily identify where that person mispronounces or modifies and can therefore determine which word is being mispronounced /modified because you already know how the correct word is actually pronounced.

 Now how a diffusion model works: 

1. It is first trained on a set of images.

2. Noise is deliberately added to the image which in simple terms means that the image is distorted and made unclear by adding random variations or disturbances (forward diffusion process).

3. The model is then trained to denoise the image by predicting the original clean image from the noisy version. (Reverse diffusion process).








Diffusion transformer:

Training data: The diffusion transformer is pretrained on Text data, video data, annotations and metadata, aligned text-video Pair data. The model already understand the underlying structure and patterns of the images in the dataset by iteratively applying noise to them and then attempting to denoise them back to their original form.

Attention Mechanisms: The model utilizes self-attention mechanisms to analyze relationships between different parts of the input text and learn the context of the prompt.

Input: Images and videos are divided into small patches, which act as the model's input units. These patches are small, localized regions as you can see in the image below and the compact representation of each patch is often referred to as a patch embedding.
How are these “patch embeddings” obtained?
Answer - Using Convocational neural networks.

“For now just know that CNNs have the mechanism that can extract features from images, capturing information about shapes, textures, and patterns present in it”

So the patch embeddings are higher-level representations of image patches.



Temporal context encoding: The patch embeddings from individual frames are processed to include the information conveyed by the sequence of frames over time or how things change and evolve in a video as time progresses.

Conditional generation: The transformer generates each frame of the video sequence conditioned on the information encoded in these embeddings. This ensures that the generated video is consistent with the input text prompt and the temporal context of the video.

Iterative generation: The diffusion transformer generates each frame of the video sequence iteratively, typically one pixel at a time or one patch at a time. At each step, the model predicts the next pixel or patch based on the conditioning signal (text embeddings) and the context learned from previous pixels or patches.

Sampling and refinement: The generated frames are refined over multiple iterations to improve the quality of the video and make it look more realistic. This step of refining the video is benefitted by the compute power, the more the compute power, the better the quality of the image or video generated.


Conclusion:

 
By using the information from the prompt and the structure of images learned during training, the diffusion transformer can generate images and videos from scratch that align with the semantics specified in the prompt.