3 30 3. Towards a task-based computational evaluation benchmark proposed to the deep-Q learning framework (e.g., see (Hessel et al., 2017) which combines many of these). In the current paper we use one of the more basic versions of the deep-Q learning framework, called double DQN, which forms a relatively accessible and stable implementation. Double DQN implements two identical deep neural networks which are asynchronously updated. With respect to the original version of DQN, this approach has shown to have more stable training properties. For a description of the double DQN approach, please refer to (van Hasselt et al., 2016). Layer Type In Out Kernel size Stride Normalization Activation 1 Conv. 4 16 5 2 BatchNorm ReLU 2 Conv. 16 32 5 2 BatchNorm ReLU 3 Conv. 32 32 5 2 BatchNorm ReLU 4 FC 27040 3 - - - - Table 3.1: The neural network architecture of the Q-learning agent. Conv: convolutional; FC: fully-connected 3.2.2. Optimization procedure Learning objective The Q-value estimates are optimized through minimization of the temporal difference error δ=Q(st ,at )−³rt +γmax a (Q(st+1,a))´, (3.1) whereQ(st ,at ) is the estimated Q-value for the current actionat given the current state st , rt is the obtained reward, γis the temporal discount factor andmaxa(Q(st+1;a)) is the maximal Q-value for the next state given all possible actions. In double Q-learning, the Q-values are estimated using two neural networks with identical architecture. The policy network estimates the Q-value for the current state transition and the target network estimates the maximal Q-value given the next state (the first and second terms of Equation(3.1), respectively). Training Upon each time step, after the model chooses an action (based on the Q-value predictions), the resulting state transition is saved into a replay memory (i.e., a recollection of ‘prior experiences’). In our experiments, the replay memory contained a history of 25000 transitions (state, action, reward and next state). Each action is followed by an optimization step, in which the model randomly samples a minibatch of 128 transitions from the replay memory. For each transition in the minibatch, the Q-value predictions and the temporal difference error are calculated (seeEquation(3.1)). The network parameters of the policy network are optimized using the Adam algorithm (Kingma & Ba, 2015). Inthe optimization step, only the parameters of the policy network are updated. Once in every 50000 optimization steps, the parameters of the policy network are copied to the target network. Exploration vs exploitation To make sure that the model is trained on a wide exploration space, the replay memory should contain a variety of environmental states and both successful and unsuccessful

RkJQdWJsaXNoZXIy MTk4NDMw