Skip to main content

Advertisement

Table 2 Inter-rater agreement (percentage agreement) and reliability (kappa coefficients) of the items from the COSMIN checklist (COSMIN step 3)

From: Inter-rater agreement and reliability of the COSMIN (COnsensus-based Standards for the selection of health status Measurement Instruments) Checklist

Item nr Item N (minus articles with 1 rating)a % agreement N Kappa
Box A Internal consistency (n = 195)b
A1 Does the scale consist of effect indicators, i.e. is it based on a reflective model? 185 82 193 0.06
Design requirements
A2c Was the percentage of missing items given? 183 87 190 0.48
A3c Was there a description of how missing items were handled? 180 90 187 0.54
A4 Was the sample size included in the internal consistency analysis adequate? 177 87 185 0.06d
A5c Was the unidimensionality of the scale checked? i.e. was factor analysis or IRT model applied? 180 92 187 0.69
A6 Was the sample size included in the unidimensionality analysis adequate? 166 79 178 0.27
A7 Was an internal consistency statistic calculated for each (unidimensional) (sub)scale separately? 179 85 187 0.31d
A8c Were there any important flaws in the design or methods of the study? 174 86 179 0.22d
Statistical methods
A9 for Classical Test Theory (CTT): Was Cronbach's alpha calculated? 179 93 187 0.27d,e
A10 for dichotomous scores: Was Cronbach's alpha or KR-20 calculated? 151 91 165 0.17d,e
A11 for IRT: Was a goodness of fit statistic at a global level calculated? e.g. χ2, reliability coefficient of estimated latent trait value (index of (subject or item) separation) 154 93 167 0.46d,e
Box B. Reliability (n = 141) b
Design requirements
B1c Was the percentage of missing items given? 129 87 140 0.39
B2c Was there a description of how missing items were handled? 125 91 137 0.43d
B3 Was the sample size included in the analysis adequate? 127 77 139 0.35
B4c Were at least two measurements available? 129 98 140 0.72 d
B5 Were the administrations independent? 129 73 139 0.18
B6c Was the time interval stated? 125 94 136 0.50d
B7 Were patients stable in the interim period on the construct to be measured? 126 75 138 0.24
B8 Was the time interval appropriate? 125 84 137 0.45
B9 Were the test conditions similar for both measurements? e.g. type of administration, environment, instructions 127 83 138 0.30
B10c Were there any important flaws in the design or methods of the study? 117 77 129 0.08
Statistical methods
B11 for continuous scores: Was an intraclass correlation coefficient (ICC) calculated? 119 86 133 0.59e
B12 for dichotomous/nominal/ordinal scores: Was kappa calculated? 111 81 127 0.32e
B13 for ordinal scores: Was a weighted kappa calculated? 111 83 127 0.42e
B14 for ordinal scores: Was the weighting scheme described? e.g. linear, quadratic 108 81 124 0.35e
Box D. Content validity (n = 83) b
Design requirements
D1 Was there an assessment of whether all items refer to relevant aspects of the construct to be measured? 62 79 83 0.33
D2 Was there an assessment of whether all items are relevant for the study population? (e.g. age, gender, disease characteristics, country, setting) 62 76 83 0.46
D3 Was there an assessment of whether all items are relevant for the purpose of the measurement instrument? (discriminative, evaluative, and/or predictive) 62 66 83 0.21
D4 Was there an assessment of whether all items together comprehensively reflect the construct to be measured? 62 66 83 0.15
D5c Were there any important flaws in the design or methods of the study? 58 76 78 0.13
Box E. Structural validity (n = 118) b
E1 Does the scale consist of effect indicators, i.e. is it based on a reflective model? 99 78 116 0f
Design requirements
E2c Was the percentage of missing items given? 95 87 110 0.41
E3c Was there a description of how missing items were handled? 93 91 109 0.55
E4 Was the sample size included in the analysis adequate? 94 87 109 0.56d
E5c Were there any important flaws in the design or methods of the study? 89 84 103 0.27
Statistical methods
E6 for CTT: Was exploratory or confirmatory factor analysis performed? 92 90 106 0.51d,e
E7 for IRT: Were IRT tests for determining the (uni-) dimensionality of the items performed? 62 87 80 0.39e,f
Box F. Hypotheses testing (n = 170) b
Design requirements
F1c Was the percentage of missing items given? 158 87 168 0.41
F2c Was there a description of how missing items were handled? 159 92 169 0.60d
F3 Was the sample size included in the analysis adequate? 157 84 167 0.12d
F4 Were hypotheses regarding correlations or mean differences formulated a priori (i.e. before data collection)? 158 74 168 0.42
F5 Was the expected direction of correlations or mean differences included in the hypotheses? 159 75 169 0.26e
F6 Was the expected absolute or relative magnitude of correlations or mean differences included in the hypotheses? 159 82 168 0.29e
F7c for convergent validity: Was an adequate description provided of the comparator instrument(s)? 125 83 136 0.30
F8c for convergent validity: Were the measurement properties of the comparator instrument(s) adequately described? 124 81 135 0.35
F9c Were there any important flaws in the design or methods of the study? 131 81 145 0.17
Statistical methods
F10 Were design and statistical methods adequate for the hypotheses to be tested? 150 78 161 0.00d,e,f
Box G. Cross-cultural validity (n = 33) b
Design requirements
G1c Was the percentage of missing items given? 25 88 32 0.52
G2c Was there a description of how missing items were handled? 22 82 30 0.32
G3 Was the sample size included in the analysis adequate? 26 81 33 0.23
G4c Were both the original language in which the HR-PRO instrument was developed, and the language in which the HR-PRO instrument was translated described? 28 89 33 0.34d
G5c Was the expertise of the people involved in the translation process adequately described? e.g. expertise in the disease(s) involved, expertise in the construct to be measured, expertise in both languages 28 86 33 0.46
G6 Did the translators work independently from each other? 28 89 33 0.61
G7 Were items translated forward and backward? 28 100 33 1.00
G8c Was there an adequate description of how differences between the original and translated versions were resolved? 28 86 33 0.50
G9c Was the translation reviewed by a committee (e.g. original developers)? 25 88 31 0.56
G10c Was the HR-PRO instrument pre-tested (e.g. cognitive interviews) to check interpretation, cultural relevance of the translation, and ease of comprehension? 21 90 29 0.61
G11c Was the sample used in the pre-test adequately described? 28 79 32 0f
G12 Were the samples similar for all characteristics except language and/or cultural background? 26 81 31 0.41
G13c Were there any important flaws in the design or methods of the study? 26 85 31 0.42
Statistical methods
G14 for CTT: Was confirmatory factor analysis performed? 27 74 32 0.03e,f
G15 for IRT: Was differential item function (DIF) between language groups assessed? 13 77 23 0.28e,f
Box H. Criterion validity (n = 57) b
Design requirements
H1c Was the percentage of missing items given? 35 91 56 0.59d
H2c Was there a description of how missing items were handled? 35 97 56 0.79 d
H3 Was the sample size included in the analysis adequate? 35 69 54 0.06
H4 Can the criterion used or employed be considered as a reasonable 'gold standard'? 37 62 57 0f
H5c Were there any important flaws in the design or methods of the study? 33 79 54 0.10
Statistical methods
H6 for continuous scores: Were correlations, or the area under the receiver operating curve calculated? 37 78 56 0.16e
H7 for dichotomous scores: Were sensitivity and specificity determined? 29 83 47 0.28e,f
Box I. Responsiviness (n = 79) b
Design requirements
I1c Was the percentage of missing items given? 71 82 76 0.14d
I2c Was there a description of how missing items were handled? 73 92 77 0.36d
I3 Was the sample size included in the analysis adequate? 72 72 76 0.40
I4c Was a longitudinal design with at least two measurement used? 73 100 78 1.00 d
I5c Was the time interval stated? 73 89 78 0.25d
I6c If anything occurred in the interim period (e.g. intervention, other relevant events), was it adequately described? 72 78 75 0.17
I7c Was a proportion of the patients changed (i.e. improvement or deterioration)? 70 97 73 0.32d
Design requirements for hypotheses testing
For constructs for which a gold standard was not available
I8 Were hypotheses about changes in scores formulated a priori (i.e. before data collection)? 65 69 72 0.35
I9 Was the expected direction of correlations or mean differences of the change scores of HR-PRO instruments included in these hypotheses? 60 78 65 0.19e
I10 Were the expected absolute or relative magnitude of correlations or mean differences of the change scores of HR-PRO instruments included in these hypotheses? 61 90 66 0.05d,e
I11c Was an adequate description provided of the comparator instrument(s)? 56 70 63 0f
I12c Were the measurement properties of the comparator instrument(s) adequately described? 56 80 63 0.06
I13c Were there any important flaws in the design or methods of the study? 63 71 68 0.03
Statistical methods
I14 Were design and statistical methods adequate for the hypotheses to be tested? 63 73 67 0.21e,f
Design requirements for comparison to a gold standard
For constructs for which a gold standards was available:
I15 Can the criterion for change be considered as a reasonable 'gold standard'? 21 67 28 0f
I16c Were there any important flaws in the design or methods of the study? 12 67 21 0f
Statistical methods
I17 for continuous scores: Were correlations between change scores, or the area under the Receiver Operator Curve (ROC) curve calculated? 28 79 39 0.47e,f
I18 for dichotomous scales: Were sensitivity and specificity (changed versus not changed) determined? 28 79 37 0.15e
Box J. Interpretability (n = 42) b
J1c Was the percentage of missing items given? 22 95 41 0.80
J2c Was there a description of how missing items were handled? 21 76 41 0.19
J3 Was the sample size included in the analysis adequate? 23 74 41 0f
J4c Was the distribution of the (total) scores in the study sample described? 23 74 41 0.08
J5c Was the percentage of the respondents who had the lowest possible (total) score described? 20 95 40 0.84
J6c Was the percentage of the respondents who had the highest possible (total) score described? 21 90 41 0.70
J7c Were scores and change scores (i.e. means and SD) presented for relevant (sub) groups? e.g. for normative groups, subgroups of patients, or the general population 21 76 41 0.05
J8c Was the minimal important change (MIC) or the minimal important difference (MID) determined? 19 89 40 0.26d
J9c Were there any important flaws in the design or methods of the study? 21 71 41 0f
  1. a When calculating percentage agreement, articles that were only scored once on the particular item were not taken into account;b number of times a box was evaluated;c dichotomous item;d Items with low dispersal i.e. more than 75% of the raters who responded to an item rated the same response category;e Combined kappa coefficient calculated because of nominal response scale in a one-way design;f Negative variance component in the calculation of kappa was set at 0;g sample sizes of Generalisability box are much higher that other items, because scores of the items on the Generalisability box for all measurement properties were combined; printed in bold indicates Kappa > 0.70 or % agreement >80%.