duminică, 27 februarie 2022

FILM: Frame Interpolation for Large Motion


 


    
    In this article, we present you a frame interpolation algorithm that synthesizes
multiple intermediate frames from two input images with large in-between motion developed by Google researchers. 
    Recent methods use multiple networks to estimate optical flow or depth and
a separate network dedicated to frame synthesis. This is often complex and requires scarce optical flow or depth ground-truth. The arhitecture of FILM has three main stages.
                Scale-agnostic feature extraction stage produces feature pyramids that have similar meaning across different scales. The feature extraction starts by extracting a smaller pyramid of features from each input pyramid levels. The features are then concatenated horizontally which produces the final scale-agonostic feature pyramid.
At the Flow estimation stage there is used the residual pyramid approach that starts the estimation from the coarsest level then on each finer level, it computes a residual correction for the estimate obtained by upsampling the coarse prediction. Shared weights are used on all except on the few finest levels of the pyramid.
Fusion is the final stage of the FILM model. In this stage it concatenates at each pyramid the feature maps at moment t which are then forwarded to a UNet-like decoder that produces the final mid-frame at moment t.
    To train the model Gram Matrix (Style) Loss is used in combination with others such as L1 and VGG losses. Gram Matrix (Style) Loss computation: First, we extract the features from the predicted and the ground-truth images using a pretrained model then compute the gram matrix at each level. The Gram matrix loss is the L2 norm of the matrices. Finally, by adding the L2 norms at each level we obtain the Gram Matrix (Style) Loss.

Short Introduction


    Frame interpolation is synthesizing intermediate images between a pair of input frames. Digital photography, especially with the advent of smartphones, has made it effortless to take several pictures within a few seconds, and people naturally do so often, in their quest for just the right photo that captures the moment. These “near duplicates” create an exciting opportunity: interpolating between them can lead to surprisingly engaging videos that reveal scene (and some camera) motion, often delivering an even more pleasing sense of the moment than any one of the original photos. 
    A major challenge in frame interpolation, for frame-rate up-sampling and especially for near-duplicate photo interpolation (for which the temporal spacing can be a second or more), is handling large scene motion effectively. In this work, they present a unified, single-stage, network for frame interpolation of large motion (FILM), trained from frames alone.


Qualitative Comparisons

    
    To evaluate the effectiveness of Gram Matrix-based loss function in preserving image sharpness, we visually compare the results against images rendered with other methods
    This picture is a comparison of frame interpolation methods on sharpness. Inputs overlaid (left), individual frames are available in the Supplementary materials. SoftSplat  shows artifacts (fingers) and ABME show blurriness (the face). 
    FILM produces sharp images and preserves the fingers. As seen in the below figure, the method synthesizes visually superior results, with crisp image details on the face and preserving the articulating fingers.

    In frame interpolation, most of the occluded pixels should be visible from the input frames. A fraction of the pixels, depending on the complexity or magnitude motion, could be unavailable from the inputs.
    Thus, to effectively inpaint the pixels, models must learn appropriate motions or hallucinate novel pixels. The picture below shows qualitative results of disocclusion inpainting. The method inpaints disocclusions well, and because of the Gram-Matrix-based loss, creates sharp image details, while Soft-
Splat and ABME produce blurry in-paintings or unnatural deformations. Compared to the other approaches, 
    FILM correctly paints the pixels while maintaining sharpness. It also preserves the structure of objects, e.g. the red toy car, while SoftSplat shows deformation, and ABME creates blurry in-painting.

    `Large motion is one of the most challenging aspects of frame interpolation. To account for the expanded motion search range, models often resort to multi-scale approaches or dense feature maps to increase the model’s neural capacity. Other approaches specialize models by training on large motion datasets. The picture below shows the comparison of frame interpolation on a large motion. Inputs
with 100pixels disparity overlaid (left). Although both Soft-Splat ,ABME capture the motion on the dog’s nose, they appear blurry, and create a large artifact on the ground. 
    FILM’s strength is seen capturing the motion well and keeping the background details.


    The picture below shows the loss function comparison on FILM. L1 loss (left), L1 plus VGG loss (middle), and the proposed style loss (right), showing significant sharpness improvements (green box).




We also tried it on our selfies:


Conclusions:

    We presented a robust algorithm for large motion frame interpolation. FILM is trainable from frame triplets alone and does not depend on additional optical-flow or depth priors.
    To the best of our knowledge, this is the first work that utilizes shared feature extraction and Gram matrix loss for frame interpolation. This method outperforms other methods and handles large motion well.


Resources:

By RAC team(Popescu Cristina, Alin Voia, Robert Spinean)

BANMo: Building Animatable 3D Neural Models from Many Casual Videos

    By Neko Team (Denisa Gal, Ioana Lazar)

BANMo is a method for reconstructing high-fidelity animatable 3D models of 
a subject, from a collection of casual videos. A major improvement from prior 
work is that it does not require a pre-defined shape template or pre-registered 
cameras.
 
 
BANMo builds the model for appearance, shape and even articulations, by 
consolidating cues from thousands of input images. So how exactly does 
it work?

Initially, the shape and appearance of the object is represented in a time-invariant

rest pose, which is then modelled to time-dependent deformations

 

To achieve this, each 3D point in space is associated with three properties: colour, 

density, and a canonical embedding - an encoding which maps a point to a feature 

descriptor kept by the network:


This is what helps to match pixels from different viewpoints and at different time 
instances, and create correspondence across frames and videos:

The method’s optimisation comes from minimising three types of losses:

reconstruction losses (colour, silhouette, optical flow), feature registration 

losses (enforces 3D point prediction via canonical embeddings), and a 

3D cycle-consistency regularisation loss. 

BANMo manages to tackle two major challenges: the high volume of input data, 

and handling the free movement of the subject and camera without any 

assumptions. Moreover, it is capable of improving the reconstruction given 

more data.

 

Given these, it has proven itself better than previous methods, both through 

enhanced detailed geometry (which ViSER lacks) and better reconstruction 

of motion (which Nerfies fails to render): 

 

 

The results are high-fidelity animatable representations of the subject, which 
demonstrate the ability of this method to recover large articulations, reconstruct 
fine-geometry, and render realistic images from novel viewpoints and poses.



Bibliography:

https://paperswithcode.com/paper/banmo-building-animatable-3d-neural-model

https://arxiv.org/pdf/2112.12761v2.pdf

https://banmo-www.github.io/

 





DeepDrug: A general graph-based deep learning framework for drug relation prediction

    By Powerpuff Girls (Carmina Dinulescu, Diana Groza, Cosmina Sas)

    The search for biomedical relations between chemical compounds (drugs, molecules) and protein targets is an important part of drug discovery.

    Through the validation and prediction of drug-drug interactions (DDIs), researchers can improve the therapeutic efficacy of various drugs. Unfortunately, these interactions can also lead to the development of potentially harmful side effects. Early detection of potential issues with drug-drug interactions can help prevent the development of potentially harmful drugs.

    Although in vitro experiments are widely used for the prediction of biochemical interactions, their reliability and cost-effectiveness are still limited by their complexity and time-consuming nature. In silico approaches have received more attention due to their increasing accuracy and cost-effectiveness.

    Because drug structures can be naturally represented as graphs (with nodes and edges denoting chemical atoms and bonds, respectively), and protein structures can also be represented as a logical graph (with nodes and edges denoting amino acids and biochemical interactions, respectively), we can use graphical models to predict drug outcomes.

    “In this work, we propose DeepDrug, a novel end-to-end deep learning framework for DDI and DTI predictions. DeepDrug takes in both drug SMILES strings and protein PDB (Protein Data Bank) inputs to characterize biochemical entities into graphical representations and utilizes GCNs to learn latent feature representations that give superior level of accuracy for predictive modeling. The competitive edge of graph-based architecture allows DeepDrug to incorporate both DDI and DTI predictions into a general framework. It also empowers DeepDrug to be applied to novel entities whose graphical representations can be extracted. Overall, through extensive experiments on existing DDI and DTI datasets and detailed comparison with other published methods, we demonstrate the promising performance of DeepDrug in drug-related interaction prediction tasks.”

 

    The results of this research suggest that DeepDrug can be used not just to predict drug relationships, but also to uncover drug interaction processes. DeepDrug has demonstrated its effectiveness in a variety of DDI and DTI prediction tasks, although there is still potential for development.


Bibliography 

https://www.biorxiv.org/content/biorxiv/early/2020/11/10/2020.11.09.375626.full.pdf?fbclid=IwAR2VdEeLmcRraTeqKFEaK-Jp-fl-oMwhNTPImkvMekKA1PkhnbMS_vks7iw

https://www.youtube.com/watch?v=QrcO2i0dEHE


Pastiche Master: Exemplar-Based High-Resolution Portrait Style Transfer

1. Introduction           Artistic portraits are popular in our daily lives and especially in industries related to comics, animations, post...