Previous papers synthesized images from single inputs. This paper presents a technology that can combine multimodal user inputs to create images. The available input types are text, segmentation, sketch and style reference. Each adding a constraint that the synthesized image must satisfy.
Product of experts
The model captures the image distribution conditioned by an arbitrary subset of possible
input modalities:
p(x|Y), Y ⊆ {y1, y2, ..., yM}
where x is the image dataset paired with the input modalities and Y is the subset of modalites. The generator must model all 2M distributions, including when Y is the empty set and for unconditional image distribution. The input modalities that were experimented with are text, semantic segmentation, sketch and style reference, but more can be easily incorporated.
Architecture
PoE-GAN consists of a product-of-experts generator and a multimodal multiscale
projection discriminator.
Each modality is encoded in a feature vector that is aggregated in the Global PoE-Net.
The segmentation and sketch maps are encoded using a convolutional network with
input skip connection, the style images are encoded using a residual network and CLIP
is used for encoding text. The decoder is composed of a stack of residual blocks.
The architecture of the generator with the Global PoE-Net and the decoder
It does not work when conditioned on contradictory multimodality inputs. One of the input
modalities (usually text) is ignored.
Conclusions
The image synthesis network is a way in which people can express themselves and
create digital content. At the same time this technology can be used to create fake
images that can be misused for spreading visual misinformation.
PoE-Gan learns to synthesize images with high quality and diversity using multiple inputs
but also is state of the art at generating from single input. When used in single input
modality it outperforms earlier unimodal networks.
Our experiments:
We chose the following segmentation, sketch and text as inputs and we show what the
output was for all combinations of modality:
Try it out at: http://gaugan.org/gaugan2/
Bibliography
X. Huang, A. Mallya, T.-C. Wang, and M.-Y. Liu, ‘Multimodal Conditional Image Synthesis with Product-of-Experts GANs’, arXiv:2112.05130 [cs], Dec. 2021.
‘Weights & Biases’, W&B.
Multimodal Conditional Image Synthesis with Product-of-Experts GANs.
X. Huang, A. Mallya, T.-C. Wang, and M.-Y. Liu, ‘Multimodal Conditional Image Synthesis with Product-of-Experts GANs’, arXiv:2112.05130 [cs], Dec. 2021.
‘Weights & Biases’, W&B.
Multimodal Conditional Image Synthesis with Product-of-Experts GANs.
Niciun comentariu:
Trimiteți un comentariu