Publications & Resources

Comparing Reliability Indices Obtained by Different Approaches for Performance Assessments

Aug 1995

Jamal Abedi, Eva L. Baker, and Howard Herl

Using performance assessment data from the CRESST content assessment project and a Monte Carlo data set, the researchers in this study compared the results and robustness of several common reliability tests when statistical assumptions are violated. They found, for example, that the percent of agreement for raters is negatively related to the number of raters: the higher the number of raters, the lower the percent of exact agreement, other things being equal. In contrast, alpha and G coefficients show considerable increases as the number of raters increase. “The results,” said researcher Jamal Abedi, “suggest that the statistics proposed in the literature for estimating interrater reliability are all affected by some conditions in the study, for example, number of raters and sample size.”

Abedi, J., Baker, E. L., & Herl, H. (1995). Comparing reliability indices obtained by different approaches for performance assessments (CSE Report 401). Los Angeles: University of California, Los Angeles, National Center for Research on Evaluation, Standards, and Student Testing (CRESST).