HUMO AI: Video Generation by Bytedance

HUMO AI focuses on generating human-centered videos from multi-modal inputs. It brings together text for intent, images for identity, and audio for timing cues. The approach is trained progressively, first learning to keep a subject consistent across frames, then learning to align motion with sound, and finally learning to balance both skills together.

Why HUMO AI

Many systems can follow prompts or keep identity, but managing both while responding to audio is difficult. HUMO AI addresses this by using minimal-invasive image injection for subject preservation and an audio focus strategy for synchronization. The result is a model that can be directed with simple settings while keeping the person recognizable and the motions aligned.

Core capabilities

Text-Image generation for appearance control and prompt following.
Text-Audio generation for speech and motion guided by sound.
Text-Image-Audio generation for fine-grained control of look and timing.
Time-adaptive guidance to adjust control strengths during inference.

Model notes

The reference model supports 480p and 720p. Quality improves at 720p. It was trained on 97 frames at 25 FPS. Multi-GPU inference is supported with FSDP and sequence parallelism.

About HUMO AI

Why HUMO AI

Core capabilities

Model notes