Introduction
Video generative models person made tremendous strides recently. Although advancements successful connection modeling are impressive, pinch their capacity to tackle much intricate tasks, generating realistic videos poses a unsocial challenge. As humans, our brains person evolved complete millions of years to instinctively observe moreover the slightest ocular inconsistencies, making realistic video procreation a remarkably analyzable task. In a erstwhile article, we discussed HunyuanVideo, a starring open-source imagination connection exemplary that has caught up to awesome closed root models for illustration Sora and Veo 2.
Beyond the emblematic usage lawsuit of entertainment, areas of accrued investigation liking for video procreation models see predicting macromolecule folding dynamics and modeling real-world environments for embodied intelligence (ex: robotics, self-driving cars). Advances successful video procreation models could beryllium instrumental for technological investigation and our expertise to create analyzable beingness systems.
Introducing Wan 2.1
On February 26th, 2025, a postulation of open-source video instauration models Wan 2.1 were released. The bid consists of 4 chopped models divided into 2 categories: text-to-video and image-to-video. The text-to-video class includes T2V-14B and T2V-1.3B, while the image-to-video class features I2V-14B-720P and I2V-14B-480P. These models scope successful size from 1.3 cardinal to 14 cardinal parameters. The larger 14B exemplary peculiarly excels successful scenarios requiring precocious motion, producing videos astatine 720p solution pinch realistic physics. Meanwhile, the smaller 1.3B exemplary offers an fantabulous discuss betwixt value and efficiency, allowing users to make 480p videos connected modular hardware successful astir 4 minutes.
On February 27th, 2025, Wan2.1 was integrated into ComfyUI, an unfastened root node-based interface for creating images, videos, and audio pinch GenAI, and connected March 3rd, 2025, Wan2.1’s T2V and I2V were integrated into Diffusers, a celebrated Python room developed by Hugging Face that provides devices and implementations for diffusion models.
In this figure, we tin spot that pinch less parameters, Wan-VAE achieves a higher ratio (frame/latency) and a comparable highest signal-to-noise ratio (PSNR) arsenic Hunyuan video.
Prerequisites
There are 2 parts to this tutorial. (1) An overview covering the exemplary architecture and training methodology and (2) an implementation wherever we tally the model. Note that the overview conception of this article mightiness person an update erstwhile Wan 2.1’s afloat method study is released. For the first portion of the tutorial, an knowing of heavy learning fundamentals is captious for pursuing on pinch the theory. Some vulnerability to concepts discussed successful this tutorial whitethorn beryllium adjuvant (ex: autoencoders, diffusion transformers, travel matching). To complete the 2nd portion of this tutorial, a GPU is required. If you don’t person entree to a GPU, see signing up for a DigitalOcean relationship to utilize a GPU Droplet. Feel free to skip the overview conception if you’re only willing successful implementing Wan 2.1.
Overview
A Refresher connected Autoencoders
An Autoencoder is simply a neural web designed to replicate its input arsenic its output. For instance, an autoencoder tin person a handwritten digit image into a compact, lower-dimensional practice known arsenic a latent representation, past reconstruct the original image. Through this process, it learns to compress information efficiently while minimizing errors successful reconstructing an image. Variational Autoencoders (VAEs), connected the different hand, encode information into a continuous, probabilistic latent abstraction alternatively than a fixed, discrete practice arsenic pinch accepted autoencoders. This allows for the procreation of new, divers information samples and soft interpolation betwixt them, captious for tasks for illustration image and video generation.
A Refresher connected Causal Convolutions
Causal convolutions are a type of convolution specifically designed for temporal data, ensuring that the model’s predictions astatine immoderate fixed timestep t are only limited connected past timesteps (t-1, t-2, …) and not connected immoderate early timesteps (t+1, t+2, …).
(Source) Standard (left) vs. Causal (right) Convolutions
Causal convolutions are applied successful different dimensions to various information types.
- 1D: Audio
- 2D: Image
- 3D: Video
Wan-VAE: A 3D Causal Variational Autoencoder
A 3D Causal Variational Autoencoder, arsenic implemented pinch Wan 2.1, is an precocious type of VAE that incorporates 3D causal convolutions, allowing it to grip both spatial and temporal dimensions successful video sequences.
This caller 3D causal VAE architecture, termed Wan-VAE, tin efficiently encode and decode 1080P videos of unlimited magnitude while preserving humanities temporal information, making it suitable for video procreation tasks.
Feature Cache and Chunking
Processing agelong videos successful 1 walk tin lead to GPU representation overflow owed to high-resolution framework information and temporal dependencies. Thus, the causal convolution module has a characteristic cache system to supply humanities information without retaining afloat video successful memory. Here, video series frames are system successful a “1 + T” input format (1 first framework + T consequent frames), dividing the video into “1 + T/4” chunks.
For example: A 17-frame video (T=16) becomes 1 + 16/4 = 5 chunks.
Each encoding/decoding cognition processes a azygous video chunk astatine a time, which corresponds to a azygous latent representation. To trim the consequence of GPU representation overflow, the number of frames successful each processing chunk is constricted to a maximum of 4. This framework limit is wished by the temporal compression ratio, which measures the compression of the clip dimension.
Text-to-Video (T2V) Architecture
The T2V models make videos from matter prompts.
Image-2-Video (I2V) Architecture
The I2V models make videos from images utilizing matter prompts.
Condition Image | Video synthesis is controlled pinch a information image arsenic the first frame |
Guidance Frames | Frames filled pinch zeros (guidance frames) are concatenated pinch the antecedently generated information image on the temporal axis |
Condition Latent Representation | A 3D VAE is utilized to compress the guidance frames into a information latent representation |
Binary Mask | A binary disguise is added (1 for preserved frames, 0 for frames to generate). This mask, spatially aligned pinch the information latent representation, extends temporally to lucifer the target video’s length |
Mask Rearrangement | The binary disguise is past reshaped to align pinch the VAE’s temporal stride, ensuring seamless integration pinch the latent representation |
DiT Model Input | The sound latent representation, information latent representation, and the rearranged binary disguise are mixed by concatenating them on the transmission axis. This mixed input is past fed into the DiT model |
Channel Projection | Due to the accrued transmission count compared to T2V models, a supplementary projection layer, initialized pinch zeros, is implemented to accommodate the input for the I2V DiT model |
CLIP Image Encoder | A CLIP image encoder extracts characteristic representations from the information image, capturing its ocular essence |
Global Context MLP | These extracted features are projected by a three-layer MLP, generating a world discourse that encapsulates the image’s wide information |
Decoupled Cross-Attention | This world discourse is past injected into the DiT exemplary via decoupled cross-attention, allowing the exemplary to leverage the information image’s features passim the video procreation process |
Implementation
Wan 2.1 offers elastic implementation options. In this tutorial, we’ll utilize Comfy UI to showcase a seamless measurement to tally the Wan 2.1 I2V model. Before pursuing on pinch this tutorial, group up a GPU Droplet and find a image you’d for illustration to person into a video.
For optimal performance, we urge selecting an “AI/ML Ready” OS and utilizing a azygous NVIDIA H100 GPU for this project.
Step 0: Install Python and Pip
apt instal python3-pipStep 1: Install ComfyUI
pip instal comfy-cli comfy installSelect nvidia erstwhile prompted “What GPU do you have?”
Step 2: Download the basal models
cd comfy/ComfyUI/models wget -P diffusion_models https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/diffusion_models/wan2.1_i2v_480p_14B_fp8_e4m3fn.safetensors wget -P text_encoders https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/text_encoders/umt5_xxl_fp8_e4m3fn_scaled.safetensors wget -P clip_vision https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/clip_vision/clip_vision_h.safetensors wget -P vae https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/vae/wan_2.1_vae.safetensorsStep 3: Launch ComfyUI
comfy launchYou’ll spot a URL successful the console output. You’ll request this URL successful a later measurement to entree the GUI.
Step 4: Open VSCode
In VSCode, click connected “Connect to…” successful the Start menu.
Choose “Connect to Host…”.
Step 5: Connect to your GPU Droplet
Click “Add New SSH Host…” and participate the SSH bid to link to your droplet. This bid is usually successful the format ssh root@[your_droplet_ip_address]. Press Enter to confirm, and a caller VSCode model will open, connected to your droplet.
You tin find your droplet’s IP reside connected the GPU droplet page.
Step 6: Access the ComfyUI GUI
In the caller VSCode model connected to your droplet, type >sim and prime “Simple Browser: Show”.
Copy the ComfyUI GUI URL from your web console (from Step 3) and paste it into the Simple Browser.
Press Enter, and the ComfyUI GUI will open.
Step 7: Update the ComfyUI Manager
Click the Manager fastener successful the apical correct corner. In the paper that pops up, click Update ComfyUI.
You’ll beryllium prompted to restart ComfyUI. Click “Restart” and refresh your browser if needed.
Step 8: Load a Workflow
Download the workflow of your prime successful Json format (here, we’re utilizing the I2V workflow).
Step 9: Install Missing Nodes
If you’re moving pinch a workflow that requires further nodes, you mightiness brushwood a “Missing Node Types” error. Go to “Manager” > “Install missing civilization nodes” and instal the latest verisions of the required nodes.
You’ll beryllium prompted to restart ComfyUI. Click “Restart” and refresh your browser if needed.
Step 10: Upload an Image
Step 11: Add Prompts
Positive Prompt vs. Negative Prompt A “positive prompt” tells a exemplary what to see successful its generated output, fundamentally guiding it towards a circumstantial desired element, while a “negative prompt” instructs the exemplary what to exclude aliases avoid, acting arsenic a select to refine the contented by removing unwanted aspects.
We will beryllium utilizing the pursuing prompts to get our characteristic to wave: Positive Prompt “A image of a seated man, his regard engaging the spectator pinch a gentle smile. One manus rests connected a wide-brimmed chapeau successful his lap, while the different lifts successful a motion of greeting.”
Negative Prompt “No blurry face, nary distorted hands, nary other limbs, nary missing limbs, nary floating hat”
Step 12: Run the Workflow
To tally the workflow, prime Queue. If you tally into errors, guarantee the correct files are passed into the nodes.
Would you look astatine that - we sewage our characteristic to activity astatine us.
Feel free to play astir pinch the different parameters to spot really capacity is altered.
Conclusion
Great work! In this tutorial, we explored the exemplary architecture of Wan 2.1, a cutting-edge postulation of video generative models. We besides successfully implemented Wan 2.1’s image-to-video exemplary utilizing ComfyUI. This accomplishment underscores the accelerated advancement of open-source video procreation models, foreshadowing a early wherever AI-generated video becomes an integral instrumentality successful various industries, including media production, technological research, and integer prototyping.
References
https://www.wan-ai.org/about
https://github.com/Wan-Video/Wan2.1
https://stable-diffusion-art.com/wan-2-1/
Additional Resources
Flow Matching for Generative Modeling (Paper Explained)
[2210.02747] Flow Matching for Generative Modeling