Prototyping Phosphene Vision Simulation-based optimization of visual neuroprosthetics using deep learning

Colofon The research presented in this thesis was carried out at the Radboud University, Donders Institute for Brain, Cognition and Behaviour with financial support from the following grants of the Dutch Organization for Scientific Research (NWO): ’NESTOR’ (STW Grant Number P15-42), ’INTENSE’ (cross-over Grant Number 17619). Printed by: Ridderprint, www.ridderprint.nl Cover by: Elze Wolfs & Jaap de Ruyter van Steveninck © 2024 Jaap de Ruyter van Steveninck

Prototyping Phosphene Vision Simulation-based optimization of visual neuroprosthetics using deep learning Proefschrift ter verkrijging van de graad van doctor aan de Radboud Universiteit Nijmegen op gezag van de rector magnificus prof. dr. J.M. Sanders, volgens besluit van het college voor promoties in het openbaar te verdedigen op donderdag 27 juni 2024 om 12.30 uur precies door Jaap de Ruyter van Steveninck geboren op 17 Augustus 1992 te Quezon City (Filipijnen)

Promotoren: Prof. dr. R.J.A. van Wezel Prof. dr. M.A.J. van Gerven Copromotor: Dr. U. Güçlü Manuscriptcommissie: Prof. dr. R.J. van Lier Dr. X. Chen (University of Pittsburgh, Verenigde Staten) Prof. dr. E. Fernández Jover (Universitas Miguel Hernández, Spanje)

Contents Preface vii 1 Introduction 1 1.1 Background................................. 2 1.1.1 Core components and mechanism of action . . . . . . . . . . . . . 2 1.1.2 Phospheneperception........................ 2 1.1.3 Limitations of phosphene vision . . . . . . . . . . . . . . . . . . . 4 1.1.4 The relevance of scene simplification . . . . . . . . . . . . . . . . 4 1.1.5 Deep neural networks for prosthetic vision . . . . . . . . . . . . . 4 1.1.6 Prototyping with simulated prosthetic vision . . . . . . . . . . . . 5 1.2 Optimization through digital simulations . . . . . . . . . . . . . . . . . . 5 1.2.1 Researchaims............................ 5 2 Real-world indoor mobility with simulated prosthetic vision 9 2.1 Introduction ................................ 10 2.2 Materialsandmethods........................... 11 2.2.1 Participants............................. 11 2.2.2 Experimentalsetup ......................... 12 2.2.3 Imageprocessing .......................... 12 2.2.4 Phosphenesimulation........................ 13 2.2.5 Experimentalprocedure. . . . . . . . . . . . . . . . . . . . . . . 14 2.2.6 Randomization ........................... 15 2.2.7 Statisticalanalysis.......................... 15 2.3 Results ................................... 16 2.3.1 Generalresults............................ 16 2.3.2 Phospheneresolution........................ 17 2.3.3 Theeffectofscenecomplexity. . . . . . . . . . . . . . . . . . . . 17 2.3.4 The effect of image processing: SharpNet versus CED . . . . . . . . 18 2.3.5 Userexperience........................... 18 2.4 Discussion ................................. 19 2.4.1 Mobility with simulated cortical prosthetic vision . . . . . . . . . . 19 2.4.2 The effect of visual complexity . . . . . . . . . . . . . . . . . . . . 20 2.4.3 Feasibility of deep learning-based surface-boundary detection for scenesimplification......................... 21 2.4.4 Limitations and future directions . . . . . . . . . . . . . . . . . . 22 2.5 Conclusion................................. 23 3 Towards a task-based computational evaluation benchmark 25 3.1 Introduction ................................ 26 3.1.1 Relatedwork............................. 26 iii

iv Contents 3.2 Methods .................................. 27 3.2.1 Thevirtualmobilitysetup . . . . . . . . . . . . . . . . . . . . . . 27 3.2.2 Optimizationprocedure. . . . . . . . . . . . . . . . . . . . . . . 30 3.2.3 Experiments............................. 31 3.3 Results ................................... 33 3.3.1 Baselineresults ........................... 33 3.3.2 Edgedetectionthresholds . . . . . . . . . . . . . . . . . . . . . . 34 3.3.3 Phospheneresolution........................ 34 3.4 Discussion ................................. 35 3.4.1 Primaryoutcomes.......................... 35 3.4.2 Generalconsiderations ....................... 36 3.4.3 Conclusion.............................. 37 4 End-to-end optimization of prosthetic vision 39 4.1 Introduction ................................ 40 4.2 Methods .................................. 41 4.2.1 Modeldescription.......................... 41 4.3 ExperimentsandResults.......................... 43 4.3.1 TrainingProcedure.......................... 44 4.3.2 Experiment1............................. 44 4.3.3 Experiment2............................. 45 4.3.4 Experiment3............................. 46 4.3.5 Experiment4............................. 48 4.4 Discussion ................................. 51 4.4.1 Automatedoptimization. . . . . . . . . . . . . . . . . . . . . . . 51 4.4.2 Tailored optimization to sparsity constraints. . . . . . . . . . . . . 52 4.4.3 Task-specific optimization for naturalistic settings. . . . . . . . . . 52 4.4.4 Tailored optimization to realistic phosphene mappings . . . . . . . 53 4.4.5 Limitations and future directions . . . . . . . . . . . . . . . . . . 54 4.5 Conclusion................................. 54 5 Towards biologically plausible phosphene simulation 57 5.1 Introduction ................................ 58 5.1.1 Backgroundandrelatedwork . . . . . . . . . . . . . . . . . . . . 59 5.2 Materialsandmethods........................... 62 5.2.1 Visuotopicmapping......................... 62 5.2.2 Phosphenesize ........................... 63 5.2.3 Phosphenebrightness........................ 64 5.2.4 Stimulationthreshold........................ 65 5.2.5 Temporaldynamics......................... 65 5.2.6 Parameterestimates......................... 65 5.3 Results ................................... 66 5.3.1 Biologicalplausibility........................ 66 5.3.2 Performance............................. 67 5.3.3 Usability in a deep learning SPV pipeline . . . . . . . . . . . . . . 67

Contents v 5.4 Discussion ................................. 72 5.4.1 Validationexperiments ....................... 73 5.4.2 End-to-endoptimization . . . . . . . . . . . . . . . . . . . . . . 75 5.4.3 General limitations and future directions . . . . . . . . . . . . . . 77 5.5 Conclusion................................. 79 6 Gaze-contingent processing improves mobility performance 91 6.1 Introduction ................................ 92 6.2 MaterialsandMethods........................... 94 6.2.1 Participants............................. 94 6.2.2 Materials............................... 95 6.2.3 PhospheneSimulation........................ 95 6.2.4 Experimentalconditions. . . . . . . . . . . . . . . . . . . . . . . 96 6.2.5 Experiment 1: Obstacle Avoidance . . . . . . . . . . . . . . . . . . 96 6.2.6 Experiment2............................. 98 6.2.7 StudyOutcomes...........................100 6.2.8 DataAnalysis ............................100 6.3 Results ...................................102 6.3.1 Primaryoutcomes..........................102 6.3.2 Secondaryoutcomes.........................103 6.4 Discussion .................................105 6.4.1 Benefits of gaze contingency with eye tracking. . . . . . . . . . . . 105 6.4.2 Superior performance in gaze-ignored simulation . . . . . . . . . . 106 6.4.3 Implications of neglecting gaze in simulations . . . . . . . . . . . . 106 6.4.4 Learningeffects...........................107 6.4.5 Strategies ..............................107 6.4.6 Limitations and future directions . . . . . . . . . . . . . . . . . . 107 6.5 Conclusion.................................108 7 Summary 111 8 General discussion 115 8.1 Naturalistic prototyping of hardware & software . . . . . . . . . . . . . . 115 8.2 Automated optimization with virtual implant users. . . . . . . . . . . . . 117 8.3 The biology of visual neurostimulation . . . . . . . . . . . . . . . . . . . 118 8.4 Furtherconsiderations...........................119 8.4.1 Societal and ethical implications. . . . . . . . . . . . . . . . . . . 120 8.5 Limitations and future directions . . . . . . . . . . . . . . . . . . . . . . 121 8.6 Conclusion.................................121 Bibliography 125 Code and Data 151 Acknowledgements 155 Nederlandse Samenvatting 159

vi Contents Portfolio 165 CurriculumVitæ.................................165 Courses, Conferences, Workshops . . . . . . . . . . . . . . . . . . . . . . . . 166 StudentCo-supervision.............................167 PublicOutreach.................................168 List of Publications 171 Donders Graduate School 173

Preface These are lucky times for a curious scientist. Contemporary research is marked by fastpaced innovations in biotechnology, neuroscience and artificial intelligence (AI). Like ripples in a pond, each of these developments instigate further scientific results and new questions in an ever-expanding chain of events. Aside from possible unease or uncertainty, the recent scientific and technological developments in AI and neurotechnology bring incontestable opportunities for advancing healthcare, progressing scientific knowledge and improving human well-being. Context: three intertwined research fields As a neuroprosthetics researcher, my work was situated on the intersection between three research fields: biomedical engineering, artificial intelligence and fundamental neuroscience. These research fields are synergistically intertwined (seeFigure1), but it is relevant to distinguish their separate connections with the interdisciplinary work in this dissertation to provide some broader context and background to this thesis. Figure 1: Neuroprosthetics research is situated at the intersection of neuroscience, artificial intelligence and medical engineering. These fields are characterized by their own principle research questions, but profit from a synergistic collaboration. From a biomedical engineering perspective, neuroprosthetic interfaces are an illustrative life-improving (hardware) technology. Many steps have been undertaken towards safe and durable electrode interfaces, opening the road for artificial communication with thebrain(Panuccio et al., 2018). Some examples include brain reading interfaces that can help restoring motor function in quadriplegia (Andersen et al., 2019), as well as brain stimulation interfaces for Parkinson’s disease, epilepsy, psychiatric symptoms or sensory disorders disorders (Kennedy et al., 2011; Kohl et al., 2014; Li & Cook, 2018; Limousin & Foltynie, 2019). In the sensory domain, especially the success of cochlear vii

viii Preface implants for restoring speech perception (Zeng, 2022) can form an inspiration. Visual neuroprostheses are following similar steps in their development (Fernández et al., 2020) and are expected to become a clinical reality in the near future. From the neuroscience perspective, neural interfaces can improve our knowledge about the brain. Age-old questions regarding the brain’s machinery are becoming easier to address by opening the ‘black box’. Canonical work by Hubel and Wiesel (1962) allready studied the fundamental processing functions responsible for understanding our visual surroundings by recording neuron responses in kitten brains with micro-electrodes. Likewise, more complex processing functions like motion detection can be explored using electrical perturbations (Britten & Van Wezel, 1998). With the recent progress in hardware technology and computational modeling (e.g, seeDado et al., 2022; Leetal., 2022), visual prosthetics and other neural interfaces can help us to learn more about the neural representations in our brain than we ever did before. From an artificial intelligence perspective, understanding our brain and natural intelligence is also an important goal. However, in this field a more model-based approach is commonly adapted - as summarized in a famous quote of physicist Richard Feynmann: “what I cannot create, I cannot understand”. A canonical example is the work of Fukushima (1980) who created an algorithmic model of the visual system to understand how it can detect objects - forming an important basis for contemporary artificial neural networks. While, for most users, these algorithms serve a different purpose than brain modeling, they are still considered an invaluable resource in that context (Richards et al., 2019). Deep neural networks have a remarkable brain-like capacity to store visual information in hierarchically organized abstract representations (Güçlü & van Gerven, 2015). And to close the circle: it is this brain-like design and their remarkable performance in visual tasks that make deep learning software a useful resource for creating intelligent visual prosthetics. One central theme: digital simulations The projects in this dissertation are of a diverse nature, but nevertheless there is a clear central theme. This thesis explores how prosthetic researchers - just like the curious dolphins on the cover of this thesis - can adapt simulations for learning and optimization. Dolphins use play as a safe form of practice for complex behaviour (Kuczaj & Eskelinen, 2014). Researchers and engineers build safe models of reality (prototypes) to safely test hypotheses in an early stage of development. The termdigital simulations is used in this dissertation as a purposely broad term that encompasses various prototyping tools. These include virtual reality technology (for ultra-realistic, controlled, environments), simulated prosthetic vision images (to recreate the experienced percept of visual prosthesis users) or deep neural networks (for creating virtual patients to evaluate and optimize prosthetic design parameters). More on the use of digital simulations for prototyping and optimization of prosthetic designs is discussed in the next chapters of this dissertation. By conveying the potential and the possible limitations of digital simulations, I want this dissertation to be a practical resource for scientists and prosthetic engineers working on visual prosthetics, contributing to our scientific understanding of the opportunities and challenges for developing effective prosthetic healthcare technology.

1 Introduction 1

1 2 1. Introduction The loss of eyesight can be the most impactful incident of a lifetime, and is associated with radical changes in life-style, impaired autonomy and increased risk for depression and even suicide (De Leo et al., 1999). For these reasons, the restoration of vision to the blind has been a centuries-old aspiration in the field of medicine. Over the past decades some promising progress has been made in the development of visual neuroprostheses that supply a functional form of visual perception through electrical stimulation of neurons in the visual pathway (Bak et al., 1990; Barry et al., 2023; Brindley & Lewin, 1968; Chen et al., 2020; Dobelle & Mladejovsky, 1974; Fernández et al., 2021; Humayun et al., 2003; Kelly et al., 2011; Keserü et al., 2012; Lowery et al., 2015; MenzelSevering et al., 2012; Oswalt et al., 2021; Panetsos et al., 2011; Pezaris & Reid, 2007; Stingl et al., 2013). The artificially created prosthetic percept is relatively elementary compared to natural vision, but the expectations are that visual prosthetics will re-enable many visually-guided activities of daily living, supporting self-dependence and improving the users’ quality of life (Beyeler & Sanchez-Garcia, 2022; Fernández et al., 2020). There are many challenges ahead for the development of visual prostheses. Besides the technical hurdles for creating a safe and durable interface with the visual cortex (e.g., see Fernández and Botella, 2018), there are many remaining questions regarding the encoding of visual signals into the brain. For instance, what information should be conveyed to create an interpretable visual percept that can support daily life visually-guided activities? In what manner is the functional quality of the prosthetic vision influenced by contextual parameters such as the visual environment or design features of the implant? How can the encoding of visual information be evaluated, optimized and tailored to deal with this multitude of contextual parameters? These questions form the basis of the research in this dissertation. 1.1. Background 1.1.1. Core components and mechanism of action Although there are distinct variations among different prosthetic designs, they generally follow the same principles of operation. The functioning of visual prostheses can be understood from the constituent main components: acameracaptures information from the surroundings. Amobile computer processes the camera frames, and calculates an electrical stimulation protocol. Theneural interface receives the stimulation protocol and accordingly activates populations of neurons in the visual pathway using electrical stimulation (seeBox1for a visualization). 1.1.2. Phosphene perception The electrical activation of visual neurons induces the visual experience of a localized light flash, called phosphene - a percept similar to ‘seeing stars’ after a sneeze or standing up too quickly. Importantly, visual prostheses make use of the topological organization of the visual system: the location of electrical stimulation in the brain influences the visual location where the phosphene is perceived. By stimulation of a subset of electrodes at different locations, one can activate a controlled subset of neural populations, resulting in a specific pattern of phosphenes. This basic pattern of phosphenes can be used to create an informative representation of the visual surrounding (seeFigure1.1).

1.1. Background 1 3 Box 1: The components of a visual prosthesis 1. A camera captures the visual information from the surroundings. While alternative solutions such as intra-ocular sensors are being explored, most designs feature a head-mounted camera, worn on a pair of glasses. The implications of using head-steered visual input are further investigated inChapter6. 2. The mobile computer processes the camera frames, and decides how the brain should be stimulated. The mobile computer makes use of intelligent scene processing algorithms that extract the most relevant visual features from a scene. The evaluation of efficient scene processing algorithms is a central theme in this dissertation, and plays a particular prominent role inChapters2 to4 3. The neural interface consists of electrode arrays that stimulate neurons in the visual pathway. Our research focuses on intracortical prosthetic interfaces that have many electrodes implanted in the primary visual cortex (V1). The perceptual effects of electrical stimulation in the primary visual cortex are studied inChapter5. Elements in this figure were created with generative AI (seeSauer et al., 2022). Prompts: “a person with glasses looking sideways” and “a busy city scene with a tram”.

1 4 1. Introduction Figure 1.1: By selectively stimulating a subset of electrodes, a pattern of phosphenes (left) can be induced to shape a simplified visual representation of the surroundings (right). 1.1.3. Limitations of phosphene vision There are many ways in which phosphene vision differs from natural vision, and the goal is not to recreate natural sight. Prosthetic percepts lack stereovision, and the color of the phosphenes cannot be controlled. Other factors can be modulated to some extent, such as the brightness and the size, but these require precise control of the electrical stimulation parameters. Possibly the most striking limitation of prosthetic vision compared to natural vision is the restricted resolution and field of view. Although these properties can to some extent be influenced by improving the implant design, the achieved resolution will not be comparable to that of natural vision. Note that some design characteristics, such as the location and number of implanted electrodes may also depend on surgical restrictions. Altogether, the different nature of the prosthetic vision and ongoing developments in hardware design choices make it difficult to predict the functional outcomes, endorsing further investigation. 1.1.4. The relevance of scene simplification For achieving a functional form of vision, it is essential to optimize the (restricted) information transfer with scene processing software on a mobile computer. The goal is to summarize the most relevant visual features from the complex visual surroundings into a simplified, informative representation that is conveyed through phosphene vision. The choice of image processing algorithm is not trivial. Research is investigating a wide variety of solutions. On the one end of the spectrum there are well-established basic image processing algorithms like thresholding, histogram equalization or edge detection (e.g., see Boyle, 2008). On the other end, ongoing research is exploring more intelligent software that can selectively extract task-relevant visual information, including depth estimation, saliency detection or semantic segmentation (e.g., seeHan et al., 2021), among many other possibilities. 1.1.5. Deep neural networks for prosthetic vision In the recent literature, a particular interest is directed towards deep neural networks (DNNs). Partly owed to the increased availability of (mobile) computational resources

1.2. Optimization through digital simulations 1 5 and massive datasets, this class of artificial intelligence algorithms has gained a lot of ground in industry and research. Inspired by their biological counterpart, DNNs consist of basic neuron models that are associated through a network of weighted connections. By ‘learning’ a suitable combination of weights through the observation of thousands or millions of data examples, these highly non-linear networks can be trained to perform complex tasks. The success of DNNs in many visual tasks make them an attractive choice for prosthetic engineers, and a large variety of pre-trained networks are readily available. What these different models have in common is their brain-like ability to condense intricate visual inputs into low-dimensional abstract representations - a convenient property for addressing the scene-simplification problem. Another notable feature of the deep learning framework is its ability to optimize a vast combinatorial space of millions of parameters during training. As explained in later chapters, this property is essential for the computational evaluation of the multitude of design factors that influence the visual quality of prosthetic vision. 1.1.6. Prototyping with simulated prosthetic vision Eventually, both software and hardware features collectively determine the functional outcome of a visual prosthesis. Importantly, the evaluation of different design choices through an iterative process of medical testing can be a challenging, expensive, and timeconsuming undertaking. Hence, the use of simulated prosthetic vision (SPV) paradigms has gained interest among engineers and researchers. In SPV, a visual rendering is created (an image or video frame) that mimics the expected phosphene percept experienced by a prosthesis user. The quality of this simulated phosphene percept can be assessed with the involvement of sighted study participants, offering a non-invasive and cost-effective evaluation method. Since simulations can model beyond the current clinical situation, they are able to assess potential future implant designs. Hereby, they can guide further advancements in the development process. 1.2. Optimization through digital simulations 1.2.1. Research aims While the use of simulation studies has by all means accelerated the cycle of hypothesis forming and experimental testing, there are many aspects that require further exploration. The work presented in the current dissertation embraces the potential of digital simulations in a broad sense - using virtual reality, simulated prosthetic vision and deep neural networks - to address several challenges in the scientific literature: First of all, there is a consensus that SPV research should move towards more immersive simulations, testing complex real-life tasks. The research in the current dissertation aims to narrow the gap between testing visual function and functional vision: basic visual function (e.g., finding edges, shapes) can be tested in relatively controlled setting, while more complex paradigms can help to investigate how functional prosthetic vision can support daily live activities to increase the autonomy of the user. The simulation studies presented inChapters2, 3and6of this dissertation address the functional requirements for mobility and orientation, as an exemplary case of complex daily life activities. Secondly, many of the design features, in particular the scene simplification software, are based on handcrafted strategies and intuitive assumptions. The benefits of proposed

1 6 1. Introduction approaches are difficult to compare due to their large variety and it is likely that many of these strategies provide benefits in a very limited context and subset of tasks. Finding a more unified and automated optimization framework for context-adaptive image processing is one of the crucial challenges in the field. The studies inChapters 3to5 discuss proof-of-principle experiments with a more automated simulation framework with virtual patients for the optimization of prosthetic vision. Thirdly, one of the inherent pitfalls of simulation frameworks is that -by definitionthey are a stylized, conceptual model of reality. In the process of shifting the focus towards simulation frameworks, it is easy to overlook the clinical reality. Many of the biological insights from brain stimulation studies are not reflected in the existing simulated cortical prosthetic vision research. While most SPV studies in the current dissertation use idealized phosphene simulations, the studies presented inChapters 5 and6 address several biologically relevant aspects of cortical stimulation to improve the realism and translational value of simulation research.

2 Real-world indoor mobility with simulated prosthetic vision Abstract Neuroprosthetic implants are a promising technology for restoring some form of vision in people with visual impairments via electrical neuro-stimulation in the visual pathway. Although an artificially generated prosthetic percept is relatively limited compared to normal vision, it may provide some elemental perception of the surroundings, re-enabling daily living functionality. For mobility in particular, various studies have investigated the benefits of visual neuroprosthetics in a simulated prosthetic vision paradigm with varying outcomes. Previous literature suggests that scene simplification via image processing, and particularly contour extraction, may potentially improve the mobility performance in a virtual environment. In the current simulation study with sighted participants, we explore both the theoretically attainable benefits of strict scene simplification in an indoor environment by controlling the environmental complexity, as well as the practically achieved improvement with a deep learning-based surface boundary detection implementation compared to traditional edge detection. A simulated electrode resolution of 26 × 26 was found to provide sufficient information for mobility in a simple environment. Our results suggest that for a lower number of implanted electrodes, the removal of background textures and within-surface gradients may be beneficial in theory. However, the deep learning-based implementation for surface boundary detection did not improve mobility performance in the current study. Furthermore, our findings indicate that for a higher number of electrodes, the removal of within-surface gradients and background textures may deteriorate, rather than improve mobility. Therefore, finding a balanced amount of scene simplification, requires a careful tradeoff between informativity and interpretability that may depend on the number of implanted electrodes. This chapter is published as de Ruyter van Steveninck, J., van Gestel, T., Koenders, P., van der Ham, G., Vereecken, F., van Gerven, M., Güçlütürk, Y., & van Wezel, R. (2022). Real-world indoor mobility with simulated prosthetic vision: The benefits and feasibility of contour-based scene simplification at different phosphene resolutions. Journal of Vision, 22(2), 1–14. https://doi.org/10.1167/JOV.22.2.1 . 9

2 10 2. Real-world indoor mobility with simulated prosthetic vision 2.1. Introduction Blindness is a common disability that causes impaired daily living functionality and reduces quality of life (Kempen et al., 2012; Stevens et al., 2013). Amongst all daily life activities, mobility and obstacle avoidance are often reported to be the most problematic (van der Geest & Buimer, 2015). For many cases of blindness there currently exists no effective treatment. However, neuroprosthetic implants are a promising technology for restoring some form of vision via electrical neuro-stimulation in the visual pathway (Chen et al., 2020; Fernández et al., 2020; Lewis et al., 2015; 2016; Pezaris & Reid, 2007; Riazi-Esfahani et al., 2014; Roelfsema et al., 2018; Shepherd et al., 2013; Tehovnik & Slocum, 2013; Tehovnik et al., 2009). Using multiple electrodes, such implants can activate a specific arrangement of visual neurons, based on camera input. This neural stimulation elicits a perceived pattern of localized point-like flashes of light, referred to as phosphenes, which can be used to represent the surroundings. The larger the number of implanted electrodes, the more phosphenes can be elicited. In this study we focus on cortical implants, which compared to other types of implants, such as retinal implants, are expected to have a wider range of therapeutic applicability (Fernández et al., 2020), are less amenable to electrical crosstalk (Davis et al., 2012; Wilke et al., 2011), andcan accommodate a larger number of electrodes. For instance, recently, Chen et al. (2020), successfully implanted over a thousand cortical electrodes to achieve artificial visual perception in macaque monkeys. Although the artificially generated prosthetic percept is relatively limited compared to normal vision, it may provide some elementary perception of the surroundings, reenabling daily living functionality. For mobility in particular, various studies have investigated the benefits of visual neuro-prosthetics in a simulated prosthetic vision (SPV) paradigm with sighted participants. Early work byCha et al. (1992b) used a perforated mask over a CRT monitor to create pixelized vision and demonstrated that 625 simulated phosphenes may provide sufficient information for visually guided mobility. More recent studies report that adequate mobility performance could be achieved with as few as 325 (Srivastava et al., 2009) or even just 60 (Dagnelie et al., 2007) phosphenes in a simple environment. Note that a conclusive interpretation of these results is complicated by differences in the used mobility task and the realism of the phosphene simulation. Besides the number of implanted electrodes, another factor that highly influences the usability of prosthetic implants is the choice of image processing protocol that transfers visual input to an appropriate electrode activation pattern. The translation of complex visual input into a phosphene percept (which by definition is limited), requires efficient reduction of information and selection of the mere essential visual features for a given task. This can be achieved with the use of traditional computer vision approaches, such as edge detection (Boyle et al., 2001; Dowling et al., 2004; Guo et al., 2018), but deep neural network models have also gained increasing interest of prosthetic engineers (e.g., Bollen et al., 2019a; Bollen et al., 2019b; de Ruyter van Steveninck et al., 2022a; Han et al., 2021; Lozano et al., 2018a; 2020; Sanchez-Garcia et al., 2020). Various image processing approaches have been proposed for mobility in particular (Barnes et al., 2011; Dagnelie et al., 2007; Dowling et al., 2006; Dowling et al., 2004; Feng & McCarthy, 2013; McCarthy et al., 2013; 2015; Parikh et al., 2013; Srivastava et al., 2009; vanRheede et al., 2010; Vergnieux et al., 2014; 2017; Zapf et al., 2016). A main line of research

2.2. Materials and methods 2 11 amongst these studies, is focused on the extraction of geometric structure and object contours for scene simplification. McCarthy et al, for instance, proposed methods for extracting scene structure (McCarthy et al., 2013) and surface boundaries (McCarthy et al., 2011) from disparity data. Based on quantitative and qualitative image analysis the authors suggest that these methods may improve interpretability of prosthetic vision and could support obstacle avoidance. To behaviorally evaluate the benefits of such scene simplification approaches, Vergnieux et al. performed experiments with SPV in a virtual environment (Vergnieux et al., 2017). The study found that visual simplification reduces virtual wayfinding performance for normal vision, but improves the performance with SPV. The highest performance with SPV was achieved when the scene was reduced to only the surface boundaries (i.e., a wireframe rendering). The aforementioned literature provides solid evidence that scene simplification, and particularly contour extraction, can help to prevent ‘overcrowding’ (i.e., transmitting more visual features than can be clearly interpreted from the limited phosphene representation) and improves the interpretability of prosthetic vision in a mobility task. Nevertheless, few attempts have been undertaken to empirically test this in a real-world setup, and there are some remaining questions and challenges: firstly, complex scenes may contain abundant textures and background gradients, which complicate contour extraction with conventional image processing applications. Although previous work has demonstrated that intelligent scene simplification methods may work in basic virtual environments (Vergnieux et al., 2017) or when evaluated as pre-converted images and videos (Han et al., 2021; Sanchez-Garcia et al., 2020), the implementation of a real-time, effective and practical image processing method in a real-world complex visual environment is a pressing issue that can bring research closer to the clinical situation. Secondly, it is unclear to what extent scene simplification contributes to improved mobility with SPV. Reducing visual information may, on the one hand, increase interpretability by preventing overcrowding, but on the other hand, excessive deprivation of visual information may also lead to impaired mobility. For example, because texture is an important cue that is used in navigation (Gibson, 1950). Explicit investigation of this trade-off between interpretability and informativity for various phosphene resolutions may provide insight into the essential components for visually-guided mobility with prosthetic vision. In the current study, we empirically evaluate contour extraction in a real-world indoor mobility task using a simulation of cortical prosthetic vision. We test two levels of contourbased scene simplification: an edge-based representation, that extracts visual gradients from all areas of the visual scene, versus a stricter surface-boundary representation, in which all within-surface information and background textures are removed. With this comparison in mind, our experiment is designed to address three study aims. i) To explore the restorable benefits for mobility with prosthetic vision and the required number of implanted electrodes. ii) To examine the theoretically attainable benefits of a stricter surface-boundary representation by removal of all within-surface gradients and background textures. iii) Test the feasibility of software-based scene simplification using a pre-trained deep neural network architecture for real-time surface-boundary detection. 2.2. Materials and methods 2.2.1. Participants We recruited 21 participants at the university campus (Radboud University, Nijmegen, the Netherlands) who had no prior experience with simulated phosphene vision. Inclusion

2 12 2. Real-world indoor mobility with simulated prosthetic vision Age [years]: 21 (20.8-23.3) Height [m]: 1.84 (1.75-1.87) Table 2.1: Summary of participant characteristics (n = 20). Median and interquartile range. criteria were: absence of mobility impairments, low susceptibility to motion sickness and normal or corrected to normal vision. One participant was unable to perform the experiments due to VR sickness and was therefore excluded from the analysis. The demographics of the remaining 20 participants are listed inTable2.1. The conducted research was approved by the local ethical committee (REC, Radboud University, Faculty of Sciences) and all subjects gave written informed consent to participate. 2.2.2. Experimental setup The experiments were situated in a 3-m-wide corridor in the basement of the university building. Two 22-m-long mobility courses were prepared containing 7 small (30 × 50 × 90 cm) and 6 large (30 × 75 × 180 cm) cardboard boxes that were placed along the corridor and acted as obstacles. In one of the two courses, which we will refer to as the “complex environment” (opposed to the “simple environment”), wallpaper and tape were used to provide supplemental visual gradients to the floor, the walls, and the obstacles (Figures 2.1 and 2.2). A combination of a laptop (Precision 7550, Dell Technologies, United States) and an attached-by-wire head-mounted VR device (Vive Pro Eye, HTC Corporation, Taiwan) was used for the simulation of prosthetic vision. To eliminate trip hazard, the participant was always accompanied by one of the researchers and connection cables were suspended in the air using a rod. Visual input was captured by the inbuilt frontal camera of the headset and was processed using Python (version 3.6.12) making use of the OpenCV (version 4.4.0) image pre-processing library (Bradski, 2000). During the experiments, a low-quality version of the video input and the displayed phosphene simulation was recorded and saved for post hoc inspection. Trial duration and collisions were registered manually. Furthermore, after each trial, participants were asked to provide a subjective rating on a 10-point Likert scale, indicating to what degree they agreed with the statement that in the current condition it was “easy to walk to the end of the hallway whilst avoiding the obstacles”. In addition to these primary endpoints, which were measured for every trial, we also gave participants the opportunity to comment on their general experience in an exit survey. Relevant observations are discussed in the results section. 2.2.3. Image processing Input frames were obtained from the inbuilt frontal fisheye camera of the VR device. Each frame was processed separately. The frames were cropped and resized to 480 × 480 pixels and depending on the experimental condition, either conventional edge detection was performed with the Canny edge detection algorithm (CED) (Canny, 1986), or surface boundary detection using SharpNet. We used the inbuilt OpenCV CED implementation together with prior smoothing using a two-dimensional Gaussian filter. In CED, gradient pixels are accepted as an edge if the gradient is higher than the upper threshold, or if the gradient is between the two thresholds and it is connected to a pixel that is above

2.2. Materials and methods 2 13 Figure 2.1: Photos of the complex (left) and plain (right) obstacle course. Both environments contained identical cardboard boxes. In the complex environment additional visual gradients are created with wallpaper and tape. Figure 2.2: Overview of the obstacle course setup. Yellow boxes indicate large obstacles, green boxes indicate small obstacles. Dashed lines indicate alternative box locations in other random route permutations. Out of all possible route layouts, a selection of seven routes of similar difficulty (based on the shortest path length around the obstacles) were used, as well as their mirrored versions. the upper threshold (Canny, 1986). Based on qualitative visual assessment and prior pilot experiments, we determined the optimal lower and higher thresholds for our environment to be equal to 25 and 50 (out of 255), respectively, in combination with a sigma parameter of 3.0 for the Gaussian smoothing. For the surface boundary detection, we used the publicly available implementation of the SharpNet model as described in (Ramamonjisoa & Lepetit, 2019) which was pretrained on the NYUv2 dataset (Silberman et al., 2012). On our laptop (graphical processing unit: NVIDIA Quadro RTX 4000) this model achieved a framerate of 18.3 Hz (standard deviation: 4.15 Hz) using the PyTorch framework (version 1.6.0) (Mazza & Pagani, 2017). In addition to the raw object boundary prediction, the SharpNet model provides an estimation of depth and surface normals. To achieve optimal results, we combined the contours on the surface normal estimation map with the thresholded boundary prediction, which yielded the best performance in pilot experiments. The optimal threshold for the boundary detection was determined at 94 (out of 255). A visualization of the image pre-processing for one example frame can be found inFigure2.3. 2.2.4. Phosphene simulation Previous literature report phosphenes as punctuate dots with a size of 0.2 to 2 degrees of visual arc (Bak et al., 1990; Schmidt et al., 1996). In our experiments, phosphenes were simulated as white, equally sized Gaussian blobs of roughly 0.3 degrees of visual

2 14 2. Real-world indoor mobility with simulated prosthetic vision Figure 2.3: Visualization of the image processing steps. A) Input image B) Blurred image, using Gaussian smooting C) Edge mask, produced using the Canny algorithm D) Simulated phosphene vision, based on Canny edge mask E) Surface normals prediction by the SharpNet deep learning model F) Surface boundary prediction by by the SharpNet deep learning model G) Surface boundary mask produced using the SharpNet predictions H) Simulated phosphene vision, based on surface boundary mask. arc (sigma: 2.0 pixels) on a rectangular grid in the center of the VR display (480 × 480 pixels; roughly 35 degrees of visual arc). The visual field of our phosphene simulation is kept constant throughout the experiment, as well as the phosphene sizes. The number of phosphenes, however, is varied across study conditions, which means, by consequence, that the phosphene density is also different across study conditions. Note that wherever we refer to the effects of the phosphene resolution, this should be interpreted as the combined effect of the number of phosphenes and the phosphene density. Phosphenes could only take binary values (‘on’ or ‘off’), as at this time, cortical visual prostheses do not allow for systematic control over phosphene brightness (Najarpour Foroushani et al., 2018; Troyk et al., 2003). To mimic biological irregularities in phosphene mapping, distortion was added to the grid locations and a minor (temporally constant) variation was applied to the brightness of individual phosphenes. 2.2.5. Experimental procedure The experiment was partitioned into three sessions, starting with a training session (ca. 20 minutes) containing practice trials with the full experimental setup, to allow the participants to get acquainted with the simulated phosphene vision and the experimental task. The remaining two sessions started with two control trials, in which normal vision was simulated by directly displaying the camera input on the VR-device, followed by 8 different phosphene conditions. The total duration of the experiment was 2.5 – 3 hours. The study conditions were designed to facilitate three types of comparisons, which correspond to our study aims: i) To obtain a general measure of restorability of mobility performance and an indication of the required number of implanted electrodes, we compared the mobility performance with SPV at six different phosphene resolutions to the performance in a control condition with normal camera vision (seeFigure2.4). ii) To examine the theoretically

2.2. Materials and methods 2 15 Camera Vision CED-based SPV SharpNet-based SPV 2 sessions x 2 sessions x 2 sessions x 2 environments x 2 environments x 2 environments x 1 simulation: 6 simulations: 2 simulations: • noSPV • 10×10 • 26×26 • 18×18 • 42×42 • 26×26 • 34×34 • 42×42 • 50×50 Total: 4 trials Total: 24 trials Total: 8 trials Table 2.2: Overview of study conditions and corresponding number of trials. CED: Canny edge detection; SPV: simulated prosthetic vision. attainable benefits of a stricter surface-boundary representation where within-surface gradients and background are removed, we compare the performance with Canny edge detection in the complex versus the plain visual environment. iii) To assess the feasibility of obtaining such strict scene simplification with real-time deep learning-based surface boundary detection, we tested SharpNet at two different phosphene resolutions in both the complex and plain visual environment. Here, the SharpNet model was evaluated against CED as a control condition. Based on the aforementioned literature on overcrowding, we hypothesize that mobility in the complex environment with a low phosphene resolution (such as 26 × 26 phosphenes), can be improved with deep learning-based scene simplification compared to basic image processing with Canny edge detection. An overview of the study conditions can be found inTable2.2. Note that a representative selection of these conditions was practiced in the training session (i.e., both low and high phosphene resolutions, both environmental complexities and both image processing methods). At the beginning of each trial, an auditory start-cue was presented to the participants. To encourage the maximal performance achievable, as limited by the visual input, instructions were to walk as fast as possible, whilst avoiding the obstacles. Between each trial, the obstacles were systematically shuffled to match one of seven pre-defined route layouts. 2.2.6. Randomization In an effort to minimize systematic bias due to learning effects, or due to characteristics of the route layout, both the order of all phosphene simulations and the order of the route layouts were randomized. For corresponding phosphene simulation conditions, the route layouts were matched but mirrored across the two different visual complexity conditions. Similarly, to allow for a clean comparison between the two image pre-processing methods, the route layouts were matched between the SharpNet and corresponding CED conditions. 2.2.7. Statistical analysis Statistical analysis was performed using the SciPy statistics toolbox (version 1.3.2) for Python(Virtanen et al., 2020). All three endpoint parameters were standardized within

2 16 2. Real-world indoor mobility with simulated prosthetic vision Figure 2.4: Image processing and phosphene simulation in the plain environment. A) Input image B) Edge mask, produced using the Canny algorithm. C) Surface boundary mask produced using the SharpNet predictions D-F) Comparison of different simulated phosphene resoutions (10 × 10, 26 × 26 and 50 × 50 phosphenes, respectively), with activations based on Canny edge mask. participants (i.e., the mean was subtracted and results are divided by the standard deviation) to reduce variance caused by inter-individual differences in walking speed, avoidance strategy and subjective experience. The endpoint parameters were found to be non-normally distributed across participants, as assessed with the Shapiro–Wilk test. Statistical hypothesis testing was performed using the Wilcoxon signed-rank test. Alpha was set at 0.05 and adjusted with the Bonferroni method for multiple planned comparisons. Six tests were performed to assess the effect of scene complexity with CED SPV at each phosphene resolution. Four tests were performed to compare surface boundary detection with SharpNet against edge detection with CED in each sub-condition that was measured (i.e., two phosphene resolutions and two scene complexities). 2.3. Results 2.3.1. General results SeeTable2.3for descriptive statistics of the obtained data. We found a small but significant negative correlation between the trial duration and the trial number (Pearson’s R = -0.15, p < 0.001). On average SPV trials in the second session were performed 3.468 seconds faster compared to the first session. No learning effects were found for number of collisions and subjective rating. The average performance varied across participants with a standard deviation of 7.748 seconds for the trial duration; 0.251 for number of collisions; and 0.783 for subjective rating. Regression analysis and subgroup analysis of the average

2.3. Results 2 17 collision frequency did not reveal an effect of trade-off between the number of collisions (accuracy) and trial duration (speed). Overall Control condition Mean (st. dev.) Mean (st. dev.) Trial duration (s) 31.02 (13.79) 16.74 (4.438) No. of collisions 0.879 (1.526) 0 (0) Subjective rating 6.130 (2.331) 9.363 (0.660) Table 2.3: Descriptive statistics of the overall results and the control condition with camera view. Std. = Standard deviation 2.3.2. Phosphene resolution The results for the CED trials and the control trials with camera vision are visualized inFigure 2.5. Assuming an absolute minimal performance at a resolution of 10 × 10 in the complex environment and defining maximal performance as the result obtained with normal vision, more than half the performance is restored at a resolution of 26 × 26 phosphenes (59.2% for trial duration, 90.2% for number of collisions and 52.6% for subjective rating) in the simple visual environment. At the same resolution of 26 × 26 phosphenes the performance was lower in the complex visual environment (52.2%, 74.1% and 48.1%, respectively), which is effectively similar to the performance in the simple condition at a lower resolution of 18 × 18 (46.2%, 81.5% and 38.8%, respectively). Figure 2.5: Mobility performance with Canny edge detection-based simulated prosthetic vision. The simulated number of phosphenes is plotted against standardized trial duration (left), standardized number of collisions (middle) and standardized subjective rating (right). Scene complexity is controlled by comparing a simple environment with plain cardboard boxes against a complex environment with additional background and surface textures. The dashed line indicates the average result for the control condition without simulated prosthetic vision (i.e., normal camera vision). Asterisk (∗) indicates p < 0.0125, double asterisk (∗∗) indicates p <0.0025. 2.3.3. The effect of scene complexity The p-values for the Wilcoxon signed-rank test on the effect of scene complexity with canny edge detection are displayed inTable2.4. Overall, the complexity-related reduction in performance was found for all lower phosphene resolutions, as evidenced by significant larger trial durations (at resolutions 10 × 10 and 18 × 18), more obstacle collisions (at resolutions 10 × 10, 18 × 18 and 26 × 26) and lower ratings (at resolutions 10 × 10 and 18 ×

www.ridderprint.nl

Made with FlippingBook

RkJQdWJsaXNoZXIy MTk4NDMw