In this post we review and compare the major test-centered and examinee-centered standard setting methods.
The most common test-centered method is the Angoff method. Since its inception, a large number of variations have been developed for this method. The most common features are described and compared here. Using the Angoff method, 10-12 judges are selected from the stakeholding groups (Angoff, 1971). These judges are asked to conceptualize a minimally qualified candidate and estimate the probability that this minimally qualified candidate will answer a particular item correctly. The estimations are then averaged across the judges for each item to get the estimated item difficulty for a minimally qualified candidate. The sum of all average item difficulties constitutes the standard.
The Angoff method has many advantages. Perhaps the greatest is its simplicity. The method is very easy to understand and compute (Berk, 1986). Since many of the judges may not have statistical/psychometric experience, the simplicity of this method allows judges to easily understand the purpose and the procedures of the method. The simplicity is also useful when reporting the outcomes to stakeholding audiences. Plake (1995) in a study designed to compare standard-setting methods found that judges preferred the Angoff method over other methods and had more confidence in a standard set using the Angoff method than standards set using other methods.
The disadvantages of the Angoff method include the difficulty in uniformly conceptualizing a minimally qualified candidate, and the difficulty the judges have in predicting the performance of a minimally qualified candidate. Several authors have noted that “judges have the sense that they are pulling the probabilities from thin air” (Berk, 1986, p. 147). Other disadvantages include a lack of reliability in the standards across multiple settings and judges, and the potential of arriving at an unrealistic standard.
Since the Angoff method is the most popular standard-setting method, a description of the major variations with their advantages and disadvantages is warranted. Some of the major variations include: judge location, type of judges used, performance data, and group discussion.
Judge location. Norcini, Lipner, Langdon, and Strecker (1987) studied the consistency of overall standards created by judges who individually estimated item difficulty previous to a Angoff meeting during which the judges make their predictions, standards set during a Angoff meeting and standards set by judges who individually estimated item difficulty after an Angoff meeting. The authors concluded that “the differences between the cutoff scores obtained in the three different conditions were relatively small” (p. 63) and that using these approaches a standard can be set in a more efficient and timely manner. In an absolute sense, the differences between the standards found by the authors were very small (59.8% to 63.5%). What is not reported by this study is the effect this 4% difference had on pass rates. Depending on the distribution of scores, a 4% difference in standards could make a large difference in pass rates.
Range of judges. Several authors have recommended the use of a broad range of judges during Angoff standard-setting procedures (Norcini, Shea, & Kanya, 1988; Plake, Melican, & Mills, 1991). The use of a broad range “provides evidence of the generalizability of recommended test standards, and divergence among judges of different background or position, when found, is informative to policymakers” (Busch & Jaeger, 1990, p. 161). A study by Norcini et al. (1988) examined the effect the inclusion of a broad range of judges had on the overall standard results. This study used medical doctors who were experts only in specific areas of the exam content but not overall experts. The authors concluded that the ratings by the medical doctors who were experts in the particular area were comparable to those with only a general knowledge. The methodology used by the authors had two potential flaws that may have affected these findings. First, the test questions were created by the judges. The process of developing and refining the questions may have made the judges experts across the entire range of the content covered by the exam or at least familiarized individual judges with the thought processes of the other judges. Secondly, the equality of the borderline group conceptualized by the judges was doubtful considering the following description of the borderline group the judges were asked to envision.
In defining a borderline group, the raters were asked to think of all of the physicians they knew who practiced critical care medicine. Then they mentally excluded those who they felt were clearly not certifiable in critical care medicine and those who showed superior knowledge and skills. The remaining physicians, those about whom they were uncertain, made up the borderline group. (p. 59)
This definition of the borderline group did not allow a consensual understanding of the definition of a borderline qualified candidate. It is likely that judges made their predictions based on a very different conceptualization of the borderline qualified test taker.
Performance data. Critics of the Angoff method have suggested that the resulting standard set by the judges may be unrealistic since it is not linked to the actual test performance of examinees. In order to guard against this possibility, several modifications to the original Angoff method call for the use of performance data during the Angoff meeting. Judges are given the p-values or difficulty values of the item to ensure that the judges stay within the general trends exhibited by the examinees. Great care must be taken when providing judges with this type of information. Each judge must understand that this is the item difficulty for the entire group, not the difficulty level for the borderline group that the judges are rating. Several studies have looked at the effect of providing this datum on the overall standards. Studies by Norcini et al. (1988) and Busch and Jaeger (1990) found similar results. The judgments created with the use of p-values more closely reflected the actual p-values. Actual changes in mean scores were minimal but the scores reflected a decrease in variability. Norcini also found that the majority of the changes made by the judges when given p-values during the second round of judging were on items that had extreme performance data associated with them. Apparently judges tend to override their original estimations based on information provided by the p-values more frequently for extreme values, thus making the overall predictions better in quality.
Instead of requiring the judges to make direct estimations of the item difficulty of each item, the Nedelsky method has judges eliminate answer options that a minimally qualified candidate would clearly be able to rule out. Since the judges are eliminating answer options, the Nedelsky method can only be used with multiple choice questions. The item difficulty is computed by placing one over the number of options that remain. For example, if the judges determined that on a four-answer multiple choice question a minimally qualified candidate could eliminate two options, the item difficulty would be one half or .50 for minimally qualified candidates. The item difficulty values for all items on a test are summed in order to determine the standard.
The advantages of the Nedelsky method include most of the advantages discussed for the Angoff method including ease of administration/explanation, and a simple analysis. Since the Nedelsky method forces the judges to consider the difficulty of each answer option, judges using the Nedelsky method may review each question in greater detail compared to Angoff judges (Mills & Melican, 1988).
There are several disadvantages to this method; first, as mentioned above, this method can only be used with multiple choice questions. This limits the format and range of questions that can be included on a test if this method is to be used. Also, judges struggled to use this method effectively for some types of items including negatively worded items (Mills & Melican, 1988). Finally, the item difficulties are not free to range across the entire potential range of difficulties. They are constrained by the number of items not eliminated by the judges (Mills & Melican, 1988). This likely causes a lack of precision. Ebel Method
The Ebel method is probably the most difficult and complex of the test-centered methods. Judges using this method are required to sort each item based on its difficulty and relevance. There are generally three levels of difficulty and four levels of relevance. The judges then estimate the number of items in each category that a minimally qualified candidate would be able to correctly answer. The standard is established by multiplying the number of items in each category by the proportion of minimally qualified candidates expected to get the item correct and then summing across the categories (Kane, 1994).
An advantage of the Ebel method is that each item is judged based on its difficulty and relevance (Mills & Melican, 1988). This allows items more important to the test to receive a higher weighting than items that are less important to the test. This method also allows the judges to rate the performance of groups of items, which may be an easier task than rating individual items (Mills & Melican). Since the judges are required to make multiple types of ratings, this method may be more complex and time consuming than the other test-centered methods. According to Mills and Melican, this may also affect the ease of training, explanation, and the collection and analysis of the data.
Cutscores set using the contrasting groups method require judges to separate a sample of examinees into two groups, experts and non-experts. The examinees are divided into these groups based on a criteria other than test performance. The standard is established at the test score that would cause the lowest number of false-positive or false-negative classifications.
An advantage of this method is that it may be easier for judges to classify examinees into the extreme categories of performance than to estimate item performance or classifying examinees into borderline groups (Mills & Melican, 1988). There are several significant disadvantages to this method. First, since the examinees have to be judged based on a criterion other than exam performance, the potential range of judges is often limited to teachers and others well acquainted with the examinees. This knowledge of the examinees may potentially bias the group assignments made by the judges. Second, “research has shown that if the test score data are normally distributed and the number of above standard and below standard students are very different, the passing standard resulting from the contrasting groups procedure is biased in the direction of whichever group is smaller” (Sireci et al, 1999, p. 303). Borderline Groups Method
Cutscores set using the borderline groups method also use judges to categorize examinees. Using this method the judges are required to identify examinees who are borderline or marginally qualified to pass the exam. These candidates are identified based on criteria other than test performance. Once the examinees are identified, the median of the test scores of the borderline group is used for the standard.
Like the contrasting groups method, an advantage of this method is the task of rating students/examinees may be a more familiar and easier task for most judges (Mills & Melican, 1988). The disadvantages are that judges have to conceptualize/define a borderline or marginally qualified candidate and there is no method to verify whether the judges’ conceptualizations are accurate. Both the borderline group and contrasting group methods are also subject to sampling variability. Comparisons Between Methods
A major criticism of standard-setting methodology is the lack of agreement between methods. The standard set for an exam may vary greatly depending on the type of method chosen to establish the standard. Several authors have documented the results of studies comparing the methods and their resulting standards. Chang (1999) in a meta-analytic type study found 40 studies comparing the Nedelsky method with Angoff methods. Chang reports that in 80% of these comparisons, the Angoff method produced higher standards than the Nedelsky method. Livingston and Zieky (1989) reviewed the past literature and found that the Ebel method produced higher standards than the Borderline, Contrasting, Nedelsky, or Angoff methods. The Borderline method produced higher standards compared to the Angoff or Contrasting groups methods. In agreement with the findings of Chang, Livingston and Zieky’s review also found that Nedelsky methods created standards lower than those created by the Angoff, Contrasting groups, and Ebel methods. Other reviews have echoed these trends (Behuniak et al., 1982).
REFERENCES
Angoff, W. H. (1971). Scales, norms, and equivalent scores. In R. L. Thorndike (Ed.), Educational measurement (2nd ed., pp. 508-600). Washington, DC: American Council on Education.
Behuniak, P., Archambault, F. X., & Gable, R. K. (1982). Angoff and Nedelsky standard-setting procedures: Implications for the validity of proficiency test score interpretation. Educational and Psychological Measurement, 42, 247-255.
Berk, R. A. (1986). A consumer’s guide to setting performance standards on criterion-referenced tests. Review of Educational Research, 56(1), 137-172.
Busch J. C., & Jaeger, R. M. (1990). Influence of type of judge, normative information, and discussion on standards recommended for the National Teachers Examinations. Journal of Educational Measurement, 27(2), 145-163.
Chang, L. (1999). Judgmental item analysis of the Nedelsky and Angoff standard-setting methods. Applied Measurement In Education, 12(2), 151-165.
Kane, M. (1994). Validating the performance standards associated with passing scores. Review of Educational Research, 64(3), 425-461.
Mills, C. N., & Melican, G. J. (1988). Estimating and adjusting cutoff scores: Features of selected methods. Applied Measurement in Education, 1(3), 261-275.
Norcini, J. J., Lipner, R. S., Langdon, L. O., & Strecker, C. A. (1987). A comparison of three variations on a standard-setting method. Journal of Educational Measurement, 24(1), 56-64.
Norcini, J. J., Shea, J. A., & Kanya, D. T. (1988). The effect of various factors on standard-setting. Journal of Educational Measurement, 25(1), 57-65.
Plake, B. S. (1995). An integration and reprise: What we think we have learned. Applied Measurement in Education, 8 (1), 85-92.
Plake, B. S., Melican, G. J., & Mills, C. N. (1991). Factors influencing intrajudge consistency during standard-setting. Educational Measurement: Issues and Practice, 10(1), 15-16, 22.
Sireci, S. G., Robin, F., & Patelis, T. (1999). Using cluster analysis to facilitate standard-setting. Applied Measurement in Education, 12(3), 301-325.