HUMO AI: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning

HUMO AI produces human-focused videos from text, image, and audio inputs. It preserves subject identity from reference images, follows prompts clearly, and synchronizes motion with sound. The system is trained with a progressive strategy and supports flexible guidance during inference for practical control of output quality and behavior.

What HUMO AI does

HUMO AI is a unified approach to human-centric video generation. It coordinates three inputs—text, image, and audio—so that each plays a clear role: text provides intent and scene direction, reference images anchor the person's identity and appearance, and audio informs motion timing and mouth dynamics. The model focuses on two core tasks and learns them in stages: keeping the subject consistent across frames and aligning movements with audio.

The framework was introduced to address two practical gaps: paired triplet training data is limited, and mixing subject preservation with audio-visual sync is non-trivial. HUMO AI tackles this by building a paired dataset and by training progressively with task-specific strategies, retaining prompt following while adding audio focus where needed. During inference, a time-adaptive guidance strategy lets you adjust how strongly the model follows each input across denoising steps.

HUMO AI Technology Overview

Key capabilities

  • Text-Image video generation

    Combine a prompt with one or more reference images to keep appearance, clothing, and style consistent while following the prompt.

  • Text-Audio video generation

    Use a prompt and audio track to guide motion timing and mouth movement without any image reference.

  • Text-Image-Audio video generation

    Direct appearance with images, guide content with text, and align motion with audio for the highest level of control.

  • Subject preservation

    Minimal-invasive image injection maintains the base model’s prompt understanding while keeping the subject consistent.

  • Audio-visual synchronization

    A focus-by-predicting strategy encourages the network to pay attention to facial regions tied to the sound track.

  • Time-adaptive guidance

    Dynamically adjust weights for text, image, and audio across denoising steps to balance fidelity and control.

How HUMO AI works at a high level

Training proceeds in two stages. First, the model learns to preserve the subject while keeping the foundation model’s ability to follow text prompts. Second, it learns audio-visual synchronization using audio cross-attention and targeted supervision near facial areas. After the model has learned each skill separately, joint learning brings them together so the system can balance multiple inputs.

During inference, you can set the frame count, resolution, and guidance scales for audio and text. The model supports 480p and 720p, with stronger quality at 720p. It was trained at 25 FPS with 97 frames, and generating much longer videos may reduce quality unless a longer checkpoint is used. Multi-GPU inference is available with FSDP and sequence parallelism for larger runs.

Use cases

Character-focused clips

Produce short human-centered clips that keep identity stable across frames and respond to clear text prompts.

Audio-guided performance

Create talking or singing segments where lip movement and body motion track the audio track timing.

Prompted reenactment with identity

Use reference images to keep the person’s look while the text prompt sets actions and the scene style.

Educational and demo content

Generate explanatory clips that follow narration timing and show a consistent presenter.

Getting started overview

  1. Prepare inputs: write a prompt, select reference images if needed, and pick an audio track.
  2. Set generation parameters: pick mode (TA or TIA), resolution, frame count, and guidance scales.
  3. Run generation: start inference and monitor results for identity, prompt adherence, and audio sync.
  4. Adjust: tune scales and steps to balance motion, detail, and timing, then re-run.

Humo AI Videos in Action

Below are sample outputs organized by category. All videos are served from the site’s public folder.

Video generation from Text + Image

Video generation from Text + Audio

Video generation from Text + Image + Audio

Subject preservation

Audio-visual synchronization

Text control / Edit

Installation

This guide summarizes a simple way to prepare an environment and run HUMO AI for human-centric video generation. It reflects the common steps found in the project resources and the notes in our homepage overview.

1. Environment setup

conda create -n humo python=3.11
conda activate humo
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install flash_attn==2.6.3
pip install -r requirements.txt
conda install -c conda-forge ffmpeg

Use a CUDA build compatible with your system. FFmpeg is required for reading and writing video.

2. Model preparation

Download the required components into a local weights/ directory structure. The following examples show how to fetch a base text-to-video component, the HUMO checkpoints, an audio encoder, and an optional audio separator for cleaner speech.

huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./weights/Wan2.1-T2V-1.3B
huggingface-cli download bytedance-research/HuMo --local-dir ./weights/HuMo
huggingface-cli download openai/whisper-large-v3 --local-dir ./weights/whisper-large-v3
huggingface-cli download huangjackson/Kim_Vocal_2 --local-dir ./weights/audio_separator
  • HUMO checkpoints: released in different sizes; pick the one that fits your GPU memory.
  • Text encoder and VAE: required by the underlying video pipeline.
  • Audio encoder: used for audio-visual synchronization in TA/TIA modes.

3. Configure generation

Edit your configuration file (for example, generate.yaml) to set length, resolution, and guidance strengths.

generation:
  frames: 97           # number of frames
  scale_a: 2.0         # audio guidance strength
  scale_t: 7.5         # text guidance strength
  mode: "TA"           # "TA" for text+audio; "TIA" for text+image+audio
  height: 720
  width: 1280

diffusion:
  timesteps:
    sampling:
      steps: 50        # denoising steps (30–50 is a common range)

720p typically yields stronger detail than 480p. The reference training covers 97 frames at 25 FPS; longer clips may need checkpoints trained for longer durations.

4. Prepare inputs

  • Text: a short prompt describing the scene and action.
  • Images (optional): one or more reference images to fix appearance and identity.
  • Audio (optional but recommended for TA/TIA): speech or music to guide timing and lip movement.

5. Run generation

Text + Audio (TA)

bash infer_ta.sh

Text + Image + Audio (TIA)

bash infer_tia.sh

Check the outputs, then adjust guidance scales or steps to balance prompt following, identity preservation, and audio sync. Multi-GPU runs can be enabled with FSDP and sequence parallelism if provided by your setup.

6. Tips

  • Use clean audio for better synchronization; optional separation can reduce background noise.
  • Start at 50 denoising steps; reduce to 30–40 for faster trials.
  • Keep prompts concise. Use reference images to lock appearance when needed.

Practical notes from the project

  • Text-Image mode supports appearance control for clothing, makeup, and props.
  • Text-Audio mode removes the need for image references, focusing on audio-driven motion.
  • For the most control, combine all three inputs in Text-Image-Audio mode.
  • Stronger results at 720p; 480p is available for faster experiments.
  • Trained at 97 frames and 25 FPS; longer outputs may need another checkpoint.
  • Multi-GPU inference with FSDP and sequence parallel is supported.

FAQs