Publications & Resources

Sampling of Common Items: An Unrecognized Source of Error in Test Equating

Jul 2004

Michalis P. Michaelides and Edward H. Haertel

There is variability in the estimation of an equating transformation because common-item parameters are obtained from responses of samples of examinees. The most commonly used standard error of equating quantifies this source of sampling error, which decreases as the sample size of examinees used to derive the transformation increases. In a similar way of reasoning, the common items that are embedded in test forms are also sampled from a larger pool of items that could potentially serve as common items. Thus, there is additional error variance due to the sampling of common items. Currently, common items are treated as fixed; the conventional standard error of equating captures only the variance due to the sampling of examinees.

In this study, a formula for quantifying the standard error due to the sampling of the common items is derived using the delta method and assuming that equating is carried out with the mean/sigma method. The analytic formula relies on the assumption of bivariate normality of the IRT difficulty parameter estimates. The derived standard error and a bootstrap approximation for the same quantity are calculated for a statewide assessment under both three- and one-parameter logistic IRT models; for the polytomous items, a graded response model is fitted. For the one-parameter logistic case, a small-sample bootstrap approximation to the standard error of equating due to the sampling of examinees is derived for comparison purposes.

There was some discrepancy between the analytic and the bootstrap approximation of the error due to the sampling of common items. Examination of the assumption of bivariate normality of the difficulty parameter estimates showed that the assumption does not hold for the data set analyzed. For simulated data drawn from a population that was distributed as bivariate normal, the two methods for estimating the error gave nearly identical results, confirming the correctness of the analytic approximation. The comparison with the examinee-sampling standard error of equating revealed that the two sources of equating error were of about the same magnitude. In other words, the conventional standard error of the equating function reflects only about half the equating error variation. Numerical results demonstrate that for individual examinee scores the two equating errors comprised only a small proportion of the total error variance; measurement error was the largest component in individual score variability. For group-level scores though, the picture was different. Measurement error in score summaries shrinks as sample size increases. Examinee-sampling equating error also decreases as samples become larger. Error due to common-item sampling does not depend on the size of the examinee sample—it is affected by the number of common items used—so it could constitute the dominant source of error for summary scores. The random selection of common items should be acknowledged in the analysis of a test and the arising error variance calculated for proper reporting of score accuracy.

Michaelides, M. P., & Haertel, E. H. (2004). Sampling of common items: An unrecognized source of error in test equating (CSE Report 636). Los Angeles: University of California, Los Angeles, National Center for Research on Evaluation, Standards, and Student Testing (CRESST).