HUMO AI: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning
HUMO AI produces human-focused videos from text, image, and audio inputs. It preserves subject identity from reference images, follows prompts clearly, and synchronizes motion with sound. The system is trained with a progressive strategy and supports flexible guidance during inference for practical control of output quality and behavior.
What HUMO AI does
HUMO AI is a unified approach to human-centric video generation. It coordinates three inputs—text, image, and audio—so that each plays a clear role: text provides intent and scene direction, reference images anchor the person's identity and appearance, and audio informs motion timing and mouth dynamics. The model focuses on two core tasks and learns them in stages: keeping the subject consistent across frames and aligning movements with audio.
The framework was introduced to address two practical gaps: paired triplet training data is limited, and mixing subject preservation with audio-visual sync is non-trivial. HUMO AI tackles this by building a paired dataset and by training progressively with task-specific strategies, retaining prompt following while adding audio focus where needed. During inference, a time-adaptive guidance strategy lets you adjust how strongly the model follows each input across denoising steps.

Key capabilities
Text-Image video generation
Combine a prompt with one or more reference images to keep appearance, clothing, and style consistent while following the prompt.
Text-Audio video generation
Use a prompt and audio track to guide motion timing and mouth movement without any image reference.
Text-Image-Audio video generation
Direct appearance with images, guide content with text, and align motion with audio for the highest level of control.
Subject preservation
Minimal-invasive image injection maintains the base model’s prompt understanding while keeping the subject consistent.
Audio-visual synchronization
A focus-by-predicting strategy encourages the network to pay attention to facial regions tied to the sound track.
Time-adaptive guidance
Dynamically adjust weights for text, image, and audio across denoising steps to balance fidelity and control.
How HUMO AI works at a high level
Training proceeds in two stages. First, the model learns to preserve the subject while keeping the foundation model’s ability to follow text prompts. Second, it learns audio-visual synchronization using audio cross-attention and targeted supervision near facial areas. After the model has learned each skill separately, joint learning brings them together so the system can balance multiple inputs.
During inference, you can set the frame count, resolution, and guidance scales for audio and text. The model supports 480p and 720p, with stronger quality at 720p. It was trained at 25 FPS with 97 frames, and generating much longer videos may reduce quality unless a longer checkpoint is used. Multi-GPU inference is available with FSDP and sequence parallelism for larger runs.
Use cases
Character-focused clips
Produce short human-centered clips that keep identity stable across frames and respond to clear text prompts.
Audio-guided performance
Create talking or singing segments where lip movement and body motion track the audio track timing.
Prompted reenactment with identity
Use reference images to keep the person’s look while the text prompt sets actions and the scene style.
Educational and demo content
Generate explanatory clips that follow narration timing and show a consistent presenter.
Getting started overview
- Prepare inputs: write a prompt, select reference images if needed, and pick an audio track.
- Set generation parameters: pick mode (TA or TIA), resolution, frame count, and guidance scales.
- Run generation: start inference and monitor results for identity, prompt adherence, and audio sync.
- Adjust: tune scales and steps to balance motion, detail, and timing, then re-run.
Humo AI Videos in Action
Below are sample outputs organized by category. All videos are served from the site’s public folder.
Video generation from Text + Image
Video generation from Text + Audio
Video generation from Text + Image + Audio
Subject preservation
Audio-visual synchronization
Text control / Edit
Installation
This guide summarizes a simple way to prepare an environment and run HUMO AI for human-centric video generation. It reflects the common steps found in the project resources and the notes in our homepage overview.
1. Environment setup
conda create -n humo python=3.11
conda activate humo
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install flash_attn==2.6.3
pip install -r requirements.txt
conda install -c conda-forge ffmpeg
Use a CUDA build compatible with your system. FFmpeg is required for reading and writing video.
2. Model preparation
Download the required components into a local weights/
directory structure. The following examples show how to fetch a base text-to-video component, the HUMO checkpoints, an audio encoder, and an optional audio separator for cleaner speech.
huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./weights/Wan2.1-T2V-1.3B
huggingface-cli download bytedance-research/HuMo --local-dir ./weights/HuMo
huggingface-cli download openai/whisper-large-v3 --local-dir ./weights/whisper-large-v3
huggingface-cli download huangjackson/Kim_Vocal_2 --local-dir ./weights/audio_separator
- HUMO checkpoints: released in different sizes; pick the one that fits your GPU memory.
- Text encoder and VAE: required by the underlying video pipeline.
- Audio encoder: used for audio-visual synchronization in TA/TIA modes.
3. Configure generation
Edit your configuration file (for example, generate.yaml
) to set length, resolution, and guidance strengths.
generation:
frames: 97 # number of frames
scale_a: 2.0 # audio guidance strength
scale_t: 7.5 # text guidance strength
mode: "TA" # "TA" for text+audio; "TIA" for text+image+audio
height: 720
width: 1280
diffusion:
timesteps:
sampling:
steps: 50 # denoising steps (30–50 is a common range)
720p typically yields stronger detail than 480p. The reference training covers 97 frames at 25 FPS; longer clips may need checkpoints trained for longer durations.
4. Prepare inputs
- Text: a short prompt describing the scene and action.
- Images (optional): one or more reference images to fix appearance and identity.
- Audio (optional but recommended for TA/TIA): speech or music to guide timing and lip movement.
5. Run generation
Text + Audio (TA)
bash infer_ta.sh
Text + Image + Audio (TIA)
bash infer_tia.sh
Check the outputs, then adjust guidance scales or steps to balance prompt following, identity preservation, and audio sync. Multi-GPU runs can be enabled with FSDP and sequence parallelism if provided by your setup.
6. Tips
- Use clean audio for better synchronization; optional separation can reduce background noise.
- Start at 50 denoising steps; reduce to 30–40 for faster trials.
- Keep prompts concise. Use reference images to lock appearance when needed.
Practical notes from the project
- Text-Image mode supports appearance control for clothing, makeup, and props.
- Text-Audio mode removes the need for image references, focusing on audio-driven motion.
- For the most control, combine all three inputs in Text-Image-Audio mode.
- Stronger results at 720p; 480p is available for faster experiments.
- Trained at 97 frames and 25 FPS; longer outputs may need another checkpoint.
- Multi-GPU inference with FSDP and sequence parallel is supported.