S2DM:

Sector-Shaped Diffusion Models for Video Generation

Abstract

Diffusion models have achieved remarkable success in image generation. However, applying this concept to video generation introduces significant challenges, particularly in maintaining consistency and continuity throughout video frames. Existing approaches primarily address these challenges by incorporating spatiotemporal attention modules or additional temporal conditions. However, they often overlook the impact of non-shared noise between frames in the diffusion process, which can disrupt both semantic coherence and consistent stochastic details in the video. To tackle this problem, we introduce the Sector-Shaped Diffusion Model (S2DM), which employs a sector-shaped diffusion process with shared noise across frames under specific conditions. S2DM ensures that video frames maintain consistent semantic features and stochastic details, while preserving continuous temporal characteristics through guided conditions. We evaluate S2DM on various conditional video generation tasks, using optical flow or posture information as temporal conditions, and descriptive text or reference images as semantic conditions. Experimental results demonstrate that S2DM outperforms existing methods in generating videos with thematic coherence and smooth narrative progression. For text-to-video generation, where temporal conditions are not explicitly provided, we propose a three-step generation strategy that decouples the generation of temporal characteristics from semantic features. Our results can be viewd at https://s2dm.github.io/S2DM/.

Pipeline

Performance Comparison (TikTok)

Reference

Ours

Disco [Wang et al.]

Magic Animate [Xu et al.]

Performance Comparison (MHAD)

Ours

(Flow Conditioned)

Ours

(Text Conditioned)

LFDM [Ni et al.]

Performance Comparison (MUG)

Ours

(Flow Conditioned)

Ours

(Text Conditioned)

LFDM [Ni et al.]

Performance Comparison (OpenVid)

Reference Image

Ours

(Flow Conditioned)

SVD [Andreas et al.]

High-resolution Demo (MHAD 256*256)

Ablation

Training Shared Noise

Sampling Shared Noise

(Ours)

Training Shared Noise

Sampling Non-shared Noise

Training Non-shared Noise

Sampling Non-shared Noise

Training Non-shared Noise

Sampling Shared Noise

A man with glasses, wearing a black coat and blue jeans is tennis forehand swing

A man with glasses, wearing a blue shirt and black jeans is walking

Person 020 is making happiness expression

Person 076 is making happiness expression

BibTeX

TBA