Ann-Sophie Page

Chapter 4.2.4 112 the ground truth transition, and -1 otherwise. Therefore, we learn direction cues from image features inside the search windows. As our input to DQN is a sequence of feature vectors, a 3-layer LSTM of dimension 2048 is introduced to DQN architecture for encoding the temporal features into action decision process.209 The LSTMs are followed by 2 fully connected layer (fc1 and fc2). Fc1 maps input search window of dimension 20L to 50 and fc2 maps temporal features of dimension 50 to the final 2 Q-values of ‘Right’ and ‘Left’. We implemented the standard DQN training framework for our network.210 At inference time, we let the agents explore the video until they converge to a fixed position (i.e. cycling between left and right actions). Two important characteristics of this solution should be highlighted: 1) we do not need to extract clip features from the entire video, just enough for the agent to reach the desired transition; 2) the agents need to be initialized at a certain position in the video, which we discuss later. Fig. 2. TRN architecture with (a) averaged ResNet feature extractor, (b) multi-agent network for transition retrieval and (c) Gaussian composition operator. Agent Initialization Configurations: We propose two different approaches to initialize the agents: fixed initialization (FI) and, ResNet modified initialization (RMI). FI initializes the search windows based on the statistical distribution (frame index average with relative position in video) of each phase transition on the entire training data. With FI, TRN can make predictions without viewing the entire video and save computation time. On the other hand, RMI initializes the search windows based on the averaged-feature ResNet-50 predictions by averaging the indices of all possible transitions to generate an estimation. In this way, we are very likely to have more accurate

RkJQdWJsaXNoZXIy MTk4NDMw