duminică, 6 martie 2022

Multimodal Conditional Image Synthesis with Product-of-Experts GANs

Previous papers synthesized images from single inputs. This paper presents a technology that can combine multimodal user inputs to create images. The available input types are text, segmentation, sketch and style reference. Each adding a constraint that the synthesized image must satisfy.

Product of experts

The model captures the image distribution conditioned by an arbitrary subset of possible

input modalities:

p(x|Y), Y ⊆ {y1, y2, ..., yM}

where x is the image dataset paired with the input modalities and Y is the subset of modalites. The generator must model all  2M distributions, including when Y is the empty set and for unconditional image distribution. The input modalities that were experimented with are text, semantic segmentation, sketch and style reference, but more can be easily incorporated.

      

Architecture

PoE-GAN consists of a product-of-experts generator and a multimodal multiscale

projection discriminator.

Each modality is encoded in a feature vector that is aggregated in the Global PoE-Net.

The segmentation and sketch maps are encoded using a convolutional network with

input skip connection, the style images are encoded using a residual network and CLIP

is used for encoding text. The decoder is composed of a stack of residual blocks.



The architecture of the generator with the Global PoE-Net and the decoder


It does not work when conditioned on contradictory multimodality inputs. One of the input

modalities (usually text) is ignored.


Conclusions

The image synthesis network is a way in which people can express themselves and

create digital content. At the same time this technology can be used to create fake

images that can be misused for spreading visual misinformation.


PoE-Gan learns to synthesize images with high quality and diversity using multiple inputs

but also is state of the art at generating from single input. When used in single input

modality it outperforms earlier unimodal networks.


Our experiments:

We chose the following segmentation, sketch and text as inputs and we show what the

output was for all combinations of modality:


Text

Sketch

Segmentation

river with rocks on the bank under a grass fields with a few trees and clear sky


Text:

Sketch:


Segmentation:


Text + Sketch:


Text + Segmentation:


Sketch + Segmentation:


Try it out at: http://gaugan.org/gaugan2/

Bibliography

  1. X. Huang, A. Mallya, T.-C. Wang, and M.-Y. Liu, ‘Multimodal Conditional Image Synthesis with Product-of-Experts GANs’, arXiv:2112.05130 [cs], Dec. 2021.

  2. ‘Weights & Biases’, W&B.

  3. Multimodal Conditional Image Synthesis with Product-of-Experts GANs.


Niciun comentariu:

Trimiteți un comentariu

Pastiche Master: Exemplar-Based High-Resolution Portrait Style Transfer

1. Introduction           Artistic portraits are popular in our daily lives and especially in industries related to comics, animations, post...