4 44 4. End-to-end optimization of prosthetic vision tested, as explained below. To quantify reconstruction performance, we report suitable image quality assessment measures. Unless stated otherwise, we report the mean squared error (MSE), the structural similarity index (SSIM) (Wang et al., 2004) and either the peak signal to noise ratio (PSNR) or the feature similarity index (FSIM) (Zhang et al., 2011) between the reconstruction and the input image, as evaluated using the Scikit-image library (version 0.16.2) for Python (van der Walt et al., 2014). Where MSE and PSNR are image quality assessment metrics that operate on pixel intensity, SSIM and FSIM are popular alternatives that better reflect perceptual quality (Preedanan et al., 2018). In addition to these performance metrics, we report the average percentage of activated electrodes as a measure for sparsity. Furthermore, to visualize the subjective quality of encodings and reconstructions, we display a subset of images with the corresponding SPV representations and image reconstructions. 4.3.1. Training Procedure The end-to-end model was implemented in PyTorch version 1.3.1., using a NVIDIA GeForce GTX 1080 TI graphics processing unit (GPU) with CUDA driver version 10.2. The trainable parameters of our model were updated using the Adam optimizer (Kingma &Ba,2015). The end-to-end model was treated as a single system, i.e., all components of the model were trained simultaneously. To account for potential convergence of the model parameters towards local optima (i.e., to reduce the likelihood that a suitable parameter configuration of the network is missed due to a combination of a specific weights initialization and learning rate), a ‘random restarts’ approach was used. That is, each model was trained five times, whilst each time randomly starting with a different weight initialization. The results only show the best performing one out of these five models (i.e., the one with the lowest loss on the validation dataset), unless stated otherwise. 4.3.2. Experiment 1 The objective of the first experiment, was to test the basic ability of the proposed endto-end model to encode a stimulus into regular, binary phosphene representations, and decode these into accurate reconstructions of the input image. For this purpose, we trained the model on a self-generated dataset containing basic black and white images with a randomly positioned lowercase alphabetic character. Each image in the dataset is 128 × 128 pixels, and contains a randomly selected character, displayed in one of 47 fonts (38 fonts for the training dataset, 9 fonts for the validation dataset). The model was trained to minimize the (pixel-wise) mean squared error LI = 1 N NX n=1 ∥x(n) −ˆx(n)∥2 (4.1) between the intensities of the input imagex(n) and the output reconstruction ˆx(n) overall training examples 1≤n≤N. The results of experiment 1 for the best performing model out of the five random restarts are displayed inFigure4.3. As can be observed, the reconstruction loss is successfully minimized until convergence after 39 epochs. The performance metrics on the validation dataset indicate that the model is capable of adequately reconstructing the input image from the generated SPV representation (MSE = 0.018, SSIM = 0.139, PSNR

RkJQdWJsaXNoZXIy MTk4NDMw