EXPERT SYSTEMS@UPT 2022: Multimodal Conditional Image Synthesis with Product-of-Experts GANs

duminică, 6 martie 2022

Multimodal Conditional Image Synthesis with Product-of-Experts GANs

Previous papers synthesized images from single inputs. This paper presents a technology that can combine multimodal user inputs to create images. The available input types are text, segmentation, sketch and style reference. Each adding a constraint that the synthesized image must satisfy.

Product of experts

The model captures the image distribution conditioned by an arbitrary subset of possible
input modalities:
p(x|Y), Y ⊆ {y1, y2, ..., yM}
where x is the image dataset paired with the input modalities and Y is the subset of modalites. The generator must model all 2M distributions, including when Y is the empty set and for unconditional image distribution. The input modalities that were experimented with are text, semantic segmentation, sketch and style reference, but more can be easily incorporated.

Architecture

PoE-GAN consists of a product-of-experts generator and a multimodal multiscale
projection discriminator.
Each modality is encoded in a feature vector that is aggregated in the Global PoE-Net.
The segmentation and sketch maps are encoded using a convolutional network with
input skip connection, the style images are encoded using a residual network and CLIP
is used for encoding text. The decoder is composed of a stack of residual blocks.

The architecture of the generator with the Global PoE-Net and the decoder

It does not work when conditioned on contradictory multimodality inputs. One of the input
modalities (usually text) is ignored.

Conclusions

The image synthesis network is a way in which people can express themselves and
create digital content. At the same time this technology can be used to create fake
images that can be misused for spreading visual misinformation.

PoE-Gan learns to synthesize images with high quality and diversity using multiple inputs
but also is state of the art at generating from single input. When used in single input
modality it outperforms earlier unimodal networks.

Our experiments:

We chose the following segmentation, sketch and text as inputs and we show what the
output was for all combinations of modality:

Text
Sketch
Segmentation
river with rocks on the bank under a grass fields with a few trees and clear sky

Text:
Sketch:

Segmentation:

Text + Sketch:

Text + Segmentation:

Sketch + Segmentation:

Try it out at: http://gaugan.org/gaugan2/

Bibliography

X. Huang, A. Mallya, T.-C. Wang, and M.-Y. Liu, ‘Multimodal Conditional Image Synthesis with Product-of-Experts GANs’, arXiv:2112.05130 [cs], Dec. 2021.
‘Weights & Biases’, W&B.
Multimodal Conditional Image Synthesis with Product-of-Experts GANs.

Niciun comentariu:

Trimiteți un comentariu

Abonați-vă la: Postare comentarii (Atom)