I 113 Table of Contents 115 165

Ann-Sophie Page

Chapter 4.2.4 114 Egt are loaded for transfer learning and fine tuned for 100 epochs on Cholec80 and Sacrocolpopexy respectively. The videos are subsampled to 2.4 fps, centre cropped, and resized into resolution 224 * 224 to match the input requirement of ResNet-50. These ResNet parameters are shared for RMI, search windows, TecNO and TransSV in this research We train both ResNet-50 and DQN with Adam at a learning rate of 3e-4.211 For ResNet-50, we use a batch size of 100, where phases are sampled with equal probability. For DQN, the batch size is 128 sampled from a memory of size 10000 for each agent. We tested the performance of TRN model on two datasets. Cholec80 is a publicly available benchmark that contains 80 videos of cholecystectomy surgeries divided into 7 phases.203 We use 40 videos for training, 20 for validation and 20 for testing. We also provide results on an inhouse dataset of laparoscopic sacrocolpopexy containing 38 videos. It contains up to 8 phases (but only 5 in most cases), however, here we consider the simplified binary segmentation of the phases related to suturing a mesh implant (2 contiguous phases), given that suturing time is one of the most important indicators of the learning curve in this procedure.30 We performed a 2-fold cross-validation with 20 videos for training, 8 for validation, and 10 for testing. For Sacrocolpopexy, we train our averaged ResNet extractor considering all phases, but train a single DQN transition retrieval for a suturing phase. We also do not require to apply Gaussian composition since we’re interested in a single phase classification. Evaluation Metrics: We utilize the commonly utilized frame-based metrics for surgical workflow: macro-averaged (per phase) precision and recall, F1-score calculated through this precision and recall, and micro-averaged accuracy. Additionally, we also provide event-based metrics that look at accuracy of phase transitions. An event is defined as block of consecutive and equal phase labels, with a start time and a stop time. We define event ratio as Egt where Egt is the number of ground truth events, and Edet is the number of detected events by each method. We define a second ratio based on the Ward metric which allocates events into sub-categories as deletion(D), insertion(It), merge(M, Mt), fragmentation(F, Ft), Fragmented and Merged(FM, FMt) and Correct(C) events.212 Here, we denote the Ward event ratio as ( C ). For both of these ratios, values closer to 1 indicate better performance. Finally, whenever fixed initialization (FI) is used, we also provide a coverage rate, indicating the average proportion of the videos that was processed to perform the segmentation. Lower values indicate fewer features need to be extracted and thus lower computation time.

Made with FlippingBook

RkJQdWJsaXNoZXIy MTk4NDMw