4.3. Experiments and Results 4 47 reconstruction tasks: First, an unsupervised perceptual reconstruction task, similar to ref. (Johnson et al., 2016; Ledig et al., 2017), which, in contrast to the pixel-wise MSEbased reconstruction task, aims to only preserve high-level perceptual features. Second, a supervised semantic boundary reconstruction task, to evaluate the additional value of using labeled supervision to specify which information needs to remain preserved in the reconstructions. The perceptual reconstruction task was formulated with the aim of minimizing higherlevel perceptual differences between the reconstruction and the input image. These more abstract perceptual differences are defined in feature-space, as opposed to the more explicit per-pixel differences which were used in the previous experiments. The feature loss is defined as LP = 1 NK NX n=1 KX k=1 ∥φ d k(x (n))−φ d k(ˆx (n))∥2 (4.4) where N is the number of training examples, φd(x(n)) andφd(ˆx(n)) are the d-th layer feature maps extracted from the input image and the reconstruction image using the VGG16 model pre-trained on the ImageNet dataset (Simonyan & Zisserman, 2015) and K is the number of feature maps of that layer. For lower values of d, the perceptual loss reveals more explicit differences in low-level features such as intensities and edges, whereas for higher values of d the perceptual loss focuses on more abstract textures or conceptual differences between the input and reconstruction (Zeiler & Fergus, 2014). We chose d equal to 3 as an optimal depth for the feature loss, based on a comparison between different values (seeFigure4.5). In the supervised semantic boundary reconstruction task, the objective was not to minimize the differences between reconstruction and input image. Instead, we aimed to provide labeled supervision that guides the model towards preserving information of semantically defined boundaries. Here, the objective was to minimize the differences between the output prediction of the decoder and processed semantic segmentation target labels of the ADE20K dataset. The semantic segmentation labels that are provided with the dataset were converted to a binary segmentation map representing the boundaries between semantic categories (i.e., boundary pixels have a value of 1 and non-boundary pixels contain the value 0). The reconstruction loss was formalized as the weighted binary cross entropy (BCE), defined as LB = −1 NJ NX n=1 JX j=1³ wz(n) j log(ˆx (n) j )+(1−w)(1−z (n) j )log(1− ˆx(n) j )´ (4.5) wherez( j (n)) is the ground truth boundary segmentation label for pixel j of example nandwis a constant that is introduced for counterbalancing the unequal distribution of non-boundary compared to boundary pixels. In our experiments wis set equal to 0.925, matching the inverse ratio of boundary pixels to non-boundary pixels. Both for the perceptual loss and the BCE loss we included a sparsity loss as in experiment 2. In addition to the three training conditions for our proposed end-to-end model, we included separate conditions where the decoder was trained on SPV representations that were generated using conventional image processing, to facilitate a quantitative comparison. These reference SPV images were generated using Canny edge detection
RkJQdWJsaXNoZXIy MTk4NDMw