In this article, we present you a frame interpolation algorithm that synthesizes
multiple intermediate frames from two input images with large in-between motion developed by Google researchers.
Recent methods use multiple networks to estimate optical flow or depth and
a separate network dedicated to frame synthesis. This is often complex and requires scarce optical flow or depth ground-truth. The arhitecture of FILM has three main stages.
Scale-agnostic feature extraction stage produces feature pyramids that have similar meaning across different scales. The feature extraction starts by extracting a smaller pyramid of features from each input pyramid levels. The features are then concatenated horizontally which produces the final scale-agonostic feature pyramid.
At the Flow estimation stage there is used the residual pyramid approach that starts the estimation from the coarsest level then on each finer level, it computes a residual correction for the estimate obtained by upsampling the coarse prediction. Shared weights are used on all except on the few finest levels of the pyramid.
Fusion is the final stage of the FILM model. In this stage it concatenates at each pyramid the feature maps at moment t which are then forwarded to a UNet-like decoder that produces the final mid-frame at moment t.
To train the model Gram Matrix (Style) Loss is used in combination with others such as L1 and VGG losses. Gram Matrix (Style) Loss computation: First, we extract the features from the predicted and the ground-truth images using a pretrained model then compute the gram matrix at each level. The Gram matrix loss is the L2 norm of the matrices. Finally, by adding the L2 norms at each level we obtain the Gram Matrix (Style) Loss.
Short Introduction
Frame interpolation is synthesizing intermediate images between a pair of input frames. Digital photography, especially with the advent of smartphones, has made it effortless to take several pictures within a few seconds, and people naturally do so often, in their quest for just the right photo that captures the moment. These “near duplicates” create an exciting opportunity: interpolating between them can lead to surprisingly engaging videos that reveal scene (and some camera) motion, often delivering an even more pleasing sense of the moment than any one of the original photos.
A major challenge in frame interpolation, for frame-rate up-sampling and especially for near-duplicate photo interpolation (for which the temporal spacing can be a second or more), is handling large scene motion effectively. In this work, they present a unified, single-stage, network for frame interpolation of large motion (FILM), trained from frames alone.
Qualitative Comparisons
To evaluate the effectiveness of Gram Matrix-based loss function in preserving image sharpness, we visually compare the results against images rendered with other methods
This picture is a comparison of frame interpolation methods on sharpness. Inputs overlaid (left), individual frames are available in the Supplementary materials. SoftSplat shows artifacts (fingers) and ABME show blurriness (the face).
FILM produces sharp images and preserves the fingers. As seen in the below figure, the method synthesizes visually superior results, with crisp image details on the face and preserving the articulating fingers.
In frame interpolation, most of the occluded pixels should be visible from the input frames. A fraction of the pixels, depending on the complexity or magnitude motion, could be unavailable from the inputs.
Thus, to effectively inpaint the pixels, models must learn appropriate motions or hallucinate novel pixels. The picture below shows qualitative results of disocclusion inpainting. The method inpaints disocclusions well, and because of the Gram-Matrix-based loss, creates sharp image details, while Soft-
Splat and ABME produce blurry in-paintings or unnatural deformations. Compared to the other approaches,
FILM correctly paints the pixels while maintaining sharpness. It also preserves the structure of objects, e.g. the red toy car, while SoftSplat shows deformation, and ABME creates blurry in-painting.
`Large motion is one of the most challenging aspects of frame interpolation. To account for the expanded motion search range, models often resort to multi-scale approaches or dense feature maps to increase the model’s neural capacity. Other approaches specialize models by training on large motion datasets. The picture below shows the comparison of frame interpolation on a large motion. Inputs
with 100pixels disparity overlaid (left). Although both Soft-Splat ,ABME capture the motion on the dog’s nose, they appear blurry, and create a large artifact on the ground.
FILM’s strength is seen capturing the motion well and keeping the background details.
The picture below shows the loss function comparison on FILM. L1 loss (left), L1 plus VGG loss (middle), and the proposed style loss (right), showing significant sharpness improvements (green box).
We also tried it on our selfies:
Conclusions:
We presented a robust algorithm for large motion frame interpolation. FILM is trainable from frame triplets alone and does not depend on additional optical-flow or depth priors.
To the best of our knowledge, this is the first work that utilizes shared feature extraction and Gram matrix loss for frame interpolation. This method outperforms other methods and handles large motion well.
By RAC team(Popescu Cristina, Alin Voia, Robert Spinean)
Niciun comentariu:
Trimiteți un comentariu