HunyuanVideo on GPU Droplets

Jan 18, 2025 01:50 AM - 3 weeks ago 28164

The advent of text-to-video models has been 1 of the galore AI miracles that came from the past year. From SORA to VEO-2, we person seen immoderate genuinely unthinkable models deed the closed root market. These models are tin of generating videos of each kinds, including photorealism, animation, master looking effects, and overmuch more. Like everything other seemingly follows successful Deep Learning, the open-source improvement organization has followed the occurrence of these closed originated models intimately & open-source models are ever trying to execute the aforesaid video value and punctual fidelity.

Recently, we person seen the merchandise of 2 notable AI text-to-video models that are making waves for illustration Stable Diffusion erstwhile did. These are specifically the LTX and HunyuanVideo text-to-video models. LTX’s debased RAM requirements and HunYuan’s versatility and trainability person surged the fame of text-to-video models to levels higher than ever.

In this bid of articles, we will talk really to usage these unthinkable models connected DigitalOcean’s NVIDIA GPU enabled GPU Droplets; first, by taking a deeper look astatine HunyuanVideo. Readers tin expect to time off this first article pinch a firmer knowing of really HunyuanVideo and related next-generation text-to-video models activity nether the hood. After covering the underlying theory, we will supply a demo showing really to get started moving the model.

Follow on to study really to create your ain unthinkable videos pinch HunyuanVideo and DigitalOcean.

Prerequisites

  • Python: this demo will incorporate intermediate level Python code. Anyone will beryllium capable to transcript and paste the codification successful to travel along, but knowing and manipulation of the scripts will require Python
  • Deep Learning: we will screen the underlying mentation down the exemplary successful the first conception of this article, and terminology utilized will require acquisition pinch Deep Learning concepts DigitalOcean account: We are going to create a GPU Droplet connected DigitalOcean, which whitethorn require the personification to create an relationship if they person not already

HunyuanVideo

HunyuanVideo is, arguably, the first open-source exemplary to rival competitory closed root models for text-to-video image generation. To execute this success, HunyuanVideo’s investigation squad made respective considerations pinch respect to its information statement and the pipeline architecture.

image

The information itself was cautiously curated and refined successful bid to only usage the astir informative training videos pinch highly dense descriptions successful text. First, the video information was aggregated from respective sources. Then, this information was parsed utilizing a bid of hierarchical refinements for each resolution, 256p, 360p, 540p, and 720p. These filtration steps focused connected removing immoderate information from the original root that had traits that were undesirable, and vanished pinch a last measurement of manual selection. After selecting the video information manually, the researchers developed a proprietary VLM to grip the task of creating descriptions for each video for each of the pursuing categories: a short description, a dense description, and descriptions of the background, style, shot-type, lighting, and ambiance of each video. These system captions supply the textual ground for training and inference.

image

Let’s now look astatine the exemplary architecture. HunyuanVideo is simply a powerful video generative exemplary pinch complete 13 cardinal parameters, 1 of the largest disposable to the open-source community. The exemplary was trained connected a spatial-temporally compressed latent space, which was compressed utilizing a Causal 3D VAE. The matter prompts were past encoded utilizing a ample connection model, and utilized arsenic the condition. To make the image, the Gaussian sound and information are taken arsenic input, and the exemplary generates an output latent, which is decoded into images aliases videos done the 3D VAE decoder. (Source)

image

Looking a spot deeper, we tin spot the Transformer creation successful HunyuanVideo above. It employs a unified Full Attention system for superior capacity compared to divided spatiotemporal attention, it supports unified procreation for some images and videos, and it leverages existing LLM-related acceleration capabilities much effectively, enhancing some training and conclusion efficiency. (Source)

"To merge textual and ocular accusation effectively, they travel the strategy of a “Dual-stream to Single-stream” hybrid exemplary creation for video generation. In the dual-stream shape of this methodology, video and matter tokens are processed independently done aggregate Transformer blocks, enabling each modality to study its ain due modulation mechanisms without interference. In the single-stream phase, it concatenates the video and matter tokens and provender them into consequent Transformer blocks for effective multimodal accusation fusion. This creation captures analyzable interactions betwixt ocular and semantic information, enhancing wide exemplary performance.” (Source)

image

For the matter encoder, they “utilize a pre-trained Multimodal Large Language Model (MLLM) pinch a Decoder-Only building [], which has pursuing advantages: (i) Compared pinch T5, MLLM aft ocular instruction finetuning has amended image-text alignment successful the characteristic space, which alleviates the trouble of instruction pursuing successful diffusion models; (ii) Compared pinch CLIP, MLLM has been demonstrated superior expertise successful image item explanation and analyzable reasoning; (iii) MLLM tin play arsenic a zero-shot learner by pursuing strategy instructions prepended to personification prompts, helping matter features salary much attraction to cardinal information.” (Source)

Put together, we person a pipeline for creating caller videos aliases images from conscionable matter inputs.

HunyuanVideo Code demo

GPU Selection

To tally HunyuanVideo, we first urge that users person capable computing powerfulness to tally the model. We urge astatine slightest 40GB of VRAM, ideally 80. For this, we for illustration to usage DigitalOcean’s Cloud GPU Droplet offerings. For much details, cheque retired this link to get started connected a GPU Droplet.

Once you person chosen a GPU connected a bully unreality level & started it up, we tin move connected to the adjacent step.

Python code

First, we are going to show really to tally HunyuanVideo pinch Python codification and Gradio. To get started, paste the pursuing into the terminal.

Git clone https://github.com/Tencent/HunyuanVideo Cd HunyuanVideo/ Pip instal -r requirements.txt python -m pip instal ninja python -m pip instal git+https://github.com/Dao-AILab/flash-attention.git@v2.6.3 python -m pip instal xfuser==0.4.0 python -m pip instal "huggingface_hub[cli]" Huggingface-cli login

You will past beryllium prompted to log successful to HuggingFace, which is required to entree the models. To really download them, aft doing the HuggingFace login, paste the pursuing into the terminal.

huggingface-cli download tencent/HunyuanVideo --local-dir ./ckpts

Once the downloads are complete, we tin motorboat the web exertion pinch this last command:

python3 gradio_server.py --flow-reverse --share

This will create a publically accessible and shareable nexus that we tin now unfastened successful our section machine’s browser.

image

From here, we tin return advantage of our powerful GPU to commencement generating videos. Enter successful a descriptive and elaborate punctual into your matter input first. Then, we propose starting pinch a debased solution (540p), to much quickly make the first video. Use the default settings pinch this alteration to make videos to start, until you find a video you like. Then, utilizing the precocious options, group a repeatable seed, truthful that you tin recreate an upscaled type of the aforesaid video astatine a higher resolution. We tin besides summation the number of conclusion steps, which we recovered to person a greater effect connected the video value than the quality of the output.

image

The exemplary is incredibly versatile and easy to use. In our testing, we recovered that it was tin of creating videos successful a wide assortment of styles including realism, fantasy, moving artwork, animation successful some 2d and 3d, and overmuch more. We were particularly impressed by the realism the exemplary could nutrient for quality figures. We moreover recovered immoderate occurrence doing basal effects activity pinch realistic characters. In particular, attraction should beryllium connected really exceptional HunyuanVideo is astatine generating each aspects of the quality assemblage and face. It does look to struggle pinch hands, but that is still the lawsuit for astir diffusion based image synthesis models and to beryllium expected. Additionally, it’s worthy noting that the exemplary is highly elaborate successful the foreground, while being somewhat lacking successful specifications successful the background; a fuzz seems to screen overmuch of the inheritance moreover astatine higher measurement counts. Overall, we recovered the exemplary to beryllium very effective, and good worthy the costs of utilizing the GPU.

Here is simply a sample video we made by compositing 5 HunyuanVideo samples pinch a MusicGen sample audio track. As you tin see, the possibilities are really endless arsenic much improvement and fine-tunes travel retired for this awesome model.

Conclusion

HunyuanVideo is simply a really awesome first effort astatine closing the spread betwixt the unfastened and closed root video procreation models. While it does not look to rather lucifer the precocious ocular levels touted by models for illustration VEO-2 and SORA, HunyuanVideo does an admirable occupation of matching the diverseness of subjects covered by these models during training. In the adjacent future, we tin expect to spot much accelerated steps guardant for video models now that open-sourcing has deed this peculiar assemblage of development, particularly from players for illustration TenCent.

Look retired for portion 2 of this bid wherever we will screen Image-to-Video procreation pinch LTX Video!

More