Level Of Agreement Between Raters

The critical difference in the CCI review is calculated on the study population The critical difference was DiffT1 – T2-1.96∗2 (102 (1-0.837)2)-11.199. Since T scores are only calculated in full numbers, this result means that, for the ELAN questionnaire, two evaluations differ statistically to a level of significance less than α – 0.05 if the difference between them is equal to or greater than 12 T. For Krippendorffs Alpha, the theoretical distribution is not known, not even asymptomatic [28]. However, empirical distribution can be determined by the bootstrap approach. Krippendorff proposed a bootstrapping algorithm [28, 29] that is also implemented in Hayes` SAS and SPSS macro [28, 30]. The proposed algorithm is different from the one described for Fleiss` K above in terms of three aspects. First, the algorithm is weighted for the number of ratings per person to account for the missing values. Second, N observations are not sampled by survey, with each observation containing the corresponding assessments of all advisors. Instead, the random sample is extracted from the random matrix needed to estimate Krippendorff`s alpha (see additional file 1). This means that dependencies between counsellors are not taken into account. The third difference is that Krippendorff retains the expected disagreement and that only the observed disagreement is recalculated at each stage of the bootstrap. We conducted simulations for a sample size of N-100 observations that showed that the probability of empirical and theoretical coverage differed significantly (average probability of empirical coverage of 60%). That`s why we chose to use in our study the same bootstrap algorithm for Krippendorffs Alpha as for Fleiss` K (hereafter the standard approach).

The result is a vector of bootstrap estimates (ranked by size) B – ([1], …, B]). Second, the bootstrap 1 – α/2 confidence interval is defined by the percentiles: the IRR was evaluated using an average two-sided, coherent and coherent Icc (McGraw-Wong, 1996) to determine the extent to which coders provided consistency in empathy beyond the subjects.

