I 40 Table of Contents 42 187

3.2. Methods 3 29 with the walls and obstacles. To formalize this in an RL-framework, we have defined a set of actions a that the agent can choose at each point in time t, which transform the state of the virtual environment st into the next state st+1. Each transition fromst to st+1 is associated with a reward signal rt . The agent’s available actions include: ‘move forward’, ’move to the left’, or ’move to the right’. In our environment there are only five (uniquely rewarded) types of state transitions, which are listed inFigure3.3). Although the virtual hallway can be extended to infinite length, the navigation task is broken down into episodes of 100 meters of forward progression (i.e., 200 forward steps). An episode is considered completed upon reaching this target distance (successful completion of the task), or whenever the agent collides with an obstacle (unsuccessful completion of the task). Both these cases are considered as a final state, which means that there is no transition into a next state, but instead the environment is reset and a new episode is initiated. Figure 3.3: Schematic illustration of the possible state transitions. The agent can choose from three actions, resulting in one of five types of state transitions. Each type of state transition is associated with a unique reward and next state. Agent The RL agent is modeled using the double Q learning framework (van Hasselt et al., 2016). Our model consists of a basic three-layer convolutional neural network with one fullyconnected output layer (seeTable3.1). The network takes as input 128×128×4pixels that represent the environment’s state. This state representation is created by stacking the last four (grayscale) frames of visual input that are captured from the environment, a commonly used solution for providing temporal information to the agent (Mnih et al., 2015). In some of the training conditions, the input frames are first converted to simulated phosphene representations, using simulation software from a prior study (de Ruyter van Steveninck et al., 2022b). The output of the model is a set of Q-value predictions, which indicate a measure of the estimated reward for each of the possible actions the agent can take. As generally formulated in the Q-learning framework, the agent’s performance is determined by its capacity to take actions that maximize the cumulative reward, hereby relying on the Q-value estimates. In the past few years, several improvements have been

RkJQdWJsaXNoZXIy MTk4NDMw