CHAPTER 7. Speaking skills 117 own subject and on the subject of the other student. is nal task could not be prepared in advance and the pro ciency ratings were based mainly on the students’ performance during this task. As the SOPA tasks had been validated for use with children aged 10-12 ( ompson et al., 2003), we conducted a pilot study with 26 students to validate our new tasks. e independent ratings of two raters were used to establish interrater agreement. e Intraclass Correlation Coe cient (single measures) on the ratings of both raters was .767 which suggests a high interrater agreement. en a new protocol was devised to obtain full rater agreement for each learner. e oral pro ciency of each of the 26 students was rated again based on the videos. If the scores were the same, those scores were used. If not, the raters independently watched the video again and scored it again. Remaining di erences were discussed until full agreement was reached. In two cases that full agreement was not reached, the average was taken as the nal score. ese new scores obtained served for further analyses. e four rating scales in the rubric, which represent four interdependent dimensions of oral pro ciency were examined by calculating Cronbach’s Alpha (which ranged from .904 to .958 when compared with the overall scores). e performances were further compared with other scores, such as reading, writing and overall class grades. Both internal and external comparisons suggested a high reliability of the four dimensions. From the group of 26 students, the performances of six students (two with lower, two with medium and two with higher scores) were selected as benchmarks for the remainder of the study. To remain consistent over the years, raters were rst trained extensively on these benchmarks before the new oral exams took place. THE TESTING PROCEDURE A er the students’ work on the academic subjects was nished, four topics were selected for the nal oral exams and students were asked to form pairs for the test and divide the four topics amongst themselves. us, each student had two topics to prepare for the test. Just prior to test administration, the teacher randomly chose one of these two topics to be used during the test for each student. e strict rating procedure devised in the pilot study was followed: Rater 1 was a French teacher at the same school, whom the students knew. Rater 2 was the group teacher (the same for all students included in this study), who interviewed the students. Rater 1 scored the performance during the exam. Rater 2 scored the exams independently a few hours later, based on the video recordings. If the scores were the same, those scores were used. If not, both raters independently watched the video again and scored it again. Remaining di erences were discussed until full agreement was reached. In a few cases that full agreement was not reached, the average was taken as the nal score.