EXPERT SYSTEMS@UPT 2022: BANMo: Building Animatable 3D Neural Models from Many Casual Videos

BANMo is a method for reconstructing high-fidelity animatable 3D models of

a subject, from a collection of casual videos. A major improvement from prior

work is that it does not require a pre-defined shape template or pre-registered

cameras.

BANMo builds the model for appearance, shape and even articulations, by

consolidating cues from thousands of input images. So how exactly does

it work?

Initially, the shape and appearance of the object is represented in a time-invariant

rest pose, which is then modelled to time-dependent deformations.

To achieve this, each 3D point in space is associated with three properties: colour,

density, and a canonical embedding - an encoding which maps a point to a feature

descriptor kept by the network:

This is what helps to match pixels from different viewpoints and at different time

instances, and create correspondence across frames and videos:

The method’s optimisation comes from minimising three types of losses:

reconstruction losses (colour, silhouette, optical flow), feature registration

losses (enforces 3D point prediction via canonical embeddings), and a

3D cycle-consistency regularisation loss.

BANMo manages to tackle two major challenges: the high volume of input data,

and handling the free movement of the subject and camera without any

assumptions. Moreover, it is capable of improving the reconstruction given

more data.

Given these, it has proven itself better than previous methods, both through

enhanced detailed geometry (which ViSER lacks) and better reconstruction

of motion (which Nerfies fails to render):

The results are high-fidelity animatable representations of the subject, which

demonstrate the ability of this method to recover large articulations, reconstruct

fine-geometry, and render realistic images from novel viewpoints and poses.

Bibliography:

EXPERT SYSTEMS@UPT 2022