SCP workflow analysis 111 MATERIALS AND METHODS We consider the task of segmenting the temporal phases of a surgical procedure from recorded video frames. The main feature of our proposed formulation can be visualized in Fig. 1. While previous work attempts to classify every frame of a video according to a surgical phase label, we attempt to predict the frame index of phase transitions. More specifically, for a surgical procedure with N different phases, our goal is to predict the frame indices where each phase starts {f1b, f2b...fNb, and where each phase ends {f1e, f2e...fNe}. If we can assume that surgical phases are continuous intervals, as it is often the case, then our approach enforces this by design. This is unlike previous frame-based approaches where spurious transitions are unavoidable with noisy predictions. To solve this problem we propose the TRN, which we described next. Fig. 1. Comparison of network architecture between (a) conventional model and (b) our proposed model with potential error illustration. The conventional model assigns labels for each individual frames and our proposed model predicts frame indices for the starts and end position of phases. Transition Retrieval Network (TRN) Figure 2 shows the architecture of our TRN model. It has three main modules: an averaged ResNet feature extractor, a multi-agent network for transition retrieval, and a Gaussian composition operator to generate the final workflow segmentation result. Averaged ResNet Feature Extractor. We first train a standard ResNet50 encoder (outputs 2048 dimension vector) [8] with supervised labels, in the same way as frame-based models. For a video clip of length K, features are averaged into a single vector. We use this to temporally down-sample the video through feature extraction. In this work we consider K = 16. DQN Transition Retrieval. We first discuss the segmentation of a single phase n. We treat it as a reinforcement learning problem with 2 discrete agents Wb and We, each being a Deep Q-Learning Network (DQN). These agents iteratively move a pair of search windows centered at frames fnb and fne, with length L. We enforced a temporal constraint where fnb≤fne is always true. The state of the agents sk is represented by the 2L features within the search window, obtained with the averaged ResNet extractor. Based on their state, the agents generate actions akb = Wb(sk), and ake = We(sk), which move the search windows either one clip to the left or to the right within the entire video. During network training, we set a +1 reward for actions that move the search window center towards
RkJQdWJsaXNoZXIy MTk4NDMw