5.3. Results 5 71 Figure 5.7: Results of training the end-to-end pipeline on video sequences from the moving MNIST dataset (Srivastava et al., 2015). Columns indicate different frames. Top row: the input frames; middle row: the simulated phosphene representations; bottom row: the decoded reconstructions of the input. stimulator hardware. For instance, rather than continuous stimulation intensities it is likely that the stimulator will allow for stimulation with only a number of (discrete) amplitudes. To evaluate whether our end-to-end pipeline can be harnessed to optimize the encoding in a constrained context, we performed a second training run (the constrained condition) where we reconfigured the encoder to output 10 discrete values between 0 and128µA. We used straight-through estimation with a smooth staircase function to estimate the gradients during backpropagation. To compensate for the relative sparsity of phosphenes in the SPV representation, we increased the training stability by taking the regularisation loss as the MSE between the pixel brightness at the phosphene centers and the corresponding pixel brightness in the input image. Furthermore, to encourage large spatial correspondence with input stimuli, we adapted the relative weights of the reconstruction loss and the regularization loss to 0.00001 and 0.99999, respectively. Note that the regularization loss only promotes similarity between the phosphene encoding and the input and the decoder is unaffected by the regularization loss. The results of the safety-constrained training after six epochs are visualized inFigure5.8. Note that overall, the resulting phosphenes are less bright and smaller due to the lower stimulation amplitudes. Nevertheless, the decoder is able to accurately reconstruct the original input. One limitation is that we did not test the subjective interpretability for human observers. As not all information in the scene is equally important, it may be informative to further constrain the phosphene representation to encode specific task-relevant features. In a third training run (the supervised boundary condition) we validated whether our simulator can be used in a supervised machine learning pipeline for the reconstruction of specific target features, such as the object boundaries. Instead of using the input image as a reference, now the MSE is used between the reconstruction and a ground truth target image, and between the pixel brightness at the phosphene centers and the corresponding pixel brightness in the target image. The ground truth semantic boundary targets were obtained by performing canny edge detection and subsequent line thickening on the semantic segmentation labels provided with the dataset. The results after training for

RkJQdWJsaXNoZXIy MTk4NDMw