The key challenge is that NeRFs typically require multiple view images to reconstruct a scene in 3D, whereas videos provide only a single view over time. But that means we have to capture a lot of data to create a NeRF.
What if there was a way to create 3D animated models of humans from monocular video footage using NeRFs?
A new paper addresses this with a novel approach.
- First, they fit a parametric model (SMPL) to align with the subject in each frame of the video. This provides an initial estimate of the 3D shape.
- Second, they transform the coordinate system of the NeRF based on the surface of the SMPL model. This involves projecting input points onto the model's surface and calculating distances to the surface.
- Third, they incorporate the SMPL model's joint rotations to animate it in a variety of poses based on the video. This adds important pose-dependent shape cues.
- Finally, they use a neural network module to further refine the coordinate transform, correcting any inaccuracies in the SMPL fit to ensure spatial alignments are accurate for rendering.
In experiments, they demonstrate their method generates high-quality renderings of subjects in novel views and poses not seen in the original video footage. The results capture nuanced clothing and hair deformations in a pose-dependent way. There are some example photos in the article that really show this off.
Limitations exist for handling extremely complex motions and generating detailed face/hand geometry from low-resolution videos. But overall, the technique significantly advances the state-of-the-art in reconstructing animatable human models from monocular video.
TLDR: They found a new NeRF technique to turn videos into controllable 3D models
Full paper summary here. Paper is here.
[link] [comments]