3.2. Methods 3 31 actions. Before the optimization starts, the replay memory is filled with 1500 transitions based on random actions. During training, adequate exploration is promoted by introducing stochasticity in the action selection with the epsilon greedy approach. Here, the actionat at optimization stept is evaluated as at =  argmax a (Q(st ,a)), if x≥ϵt random choiceat ∈a, otherwise. (3.2) The value of ϵt is linearly decaying over the course of training: ϵt =max(ϵend,(ϵstart +(δϵt))), (3.3) taking a start value ϵstart =0.95, a final value ϵend =0.05 and a linear decay stepδϵ = −0.000112. This entails that early in training, actions are chosen mostly random, while later in training, based on the Q-value estimates, actions are chosen that yield the highest reward. Validation and testing During the optimization procedure, the performance of the agent was evaluated every 25 episodes, in a validation environment. The evaluation environment used similar obstacle configurations as the training environment, but the locations were not randomized to enable comparison across validation runs. After finishing the training, the performance is tested in a fixed testing environment, which used a different obstacle layout than the validation environment. During validation and testing, the agent was configured to always take the actions that yield the highest reward (i.e., ϵt is temporarily set to 0, see Equations(3.2) and (3.3)). The state transitions in the validation environment were not saved to the replay memory, to prevent the agent from training on validation data. 3.2.3. Experiments Baseline experiment To evaluate whether our RL-based agent can adequately navigate in the virtual hallway and avoid obstacles, we performed a baseline training with natural vision. In this baseline condition, the original grayscale frames were used as input to the agent, without phosphene simulation or scene simplification. As another baseline condition, we performed a training without visual feedback, where the agent received random grayscale pixels as visual input. To evaluate the performance, we registered the number of box collisions in a fixed testing environment, as well as the obtained reward. Furthermore, to evaluate which visual information is the most relevant for the actions of the model, we performed an input-perturbation analysis. In this analysis, we systematically occluded the input image at different locations with a local-average circular mask. A mask radius of 30 pixels was used, and an equally spaced grid of 64×64 masking locations in the visual field. To measure the relevance of the masked visual region, we calculated the change in Q-value prediction for the ’forward step’ action,

RkJQdWJsaXNoZXIy MTk4NDMw