MotionStruct4D

Given a single-view monocular video, MotionStruct4D decomposes motion into rigid and non-rigid transformations, yielding high-quality 4D generation results viewable across arbitrary viewpoints and timestamps.

Abstract

Monocular Video-to-4D generation faces the fundamental challenge of inferring plausible 3D geometry and motion from limited single-view inputs. We present MotionStruct4D, a novel approach that discovers and exploits the underlying motion structure of 3D Gaussian Splatting for high-quality video-to-4D generation.

Our key insight is that real-world motion can be effectively decomposed into coarse rigid transformations that capture principal movements, complemented by detailed non-rigid deformations that account for fine-grained details. MotionStruct4D introduces:

                (1) Self-supervised Motion Structure Discovery - Identifies quasi-rigid parts by preserving spatiotemporal relationships without explicit 3D supervision.

                (2) Weighted Dense-to-Sparse Optimization - Transitions from dense per-Gaussian deformation to sparse control points, integrating rigid and non-rigid components through adaptive weighted fusion.

                (3) Comprehensive Benchmark Dataset - Curated to feature substantial object displacement and diverse articulated motion patterns for rigorous evaluation.

Method Overview

Our pipeline first identifies nearly rigid-body parts in a self-supervised way. We then decompose motion into rigid and non-rigid transformations and optimize the representation in a weighted dense-to-sparse manner.

4D Generation Results

MotionStruct4D successfully disentangles global rigid movements and local non-rigid deformations, ensuring both high rendering fidelity and consistent multi-view motion. Below we demonstrate the rendered novel views and their corresponding quasi-rigid part segmentations across multiple angles.

Aurorus

Render (15°)

Render (-75°)

Render (105°)

Render (195°)

Parts (15°)

Parts (-75°)

Parts (105°)

Parts (195°)

Blooming Rose

Render (15°)

Render (-75°)

Render (105°)

Render (195°)

Parts (15°)

Parts (-75°)

Parts (105°)

Parts (195°)

Trump

Render (15°)

Render (-75°)

Render (105°)

Render (195°)

Parts (15°)

Parts (-75°)

Parts (105°)

Parts (195°)

Walking White-Faced Egret

Render (15°)

Render (-75°)

Render (105°)

Render (195°)

Parts (15°)

Parts (-75°)

Parts (105°)

Parts (195°)

Evaluation Benchmark

Existing video-to-4D datasets often suffer from limited object displacement and insufficient motion diversity. To address these limitations, we curated a challenging benchmark.

Animal Species

Motion Sequences

512px

Evaluation Resolution

Citation

If you find our work useful, please consider citing:

@article{zhong2025motionstruct4d,
  title={MotionStruct4D: Discovering Motion Structure of Gaussian Splatting for Video-to-4D Generation},
  author={Zhong, Jia-Xing and Lu, Kai and Ye, Jiaojiao and Trigoni, Niki and Markham, Andrew},
  journal={Under review}
}