Selection, Training and Assessing Standard Setting Judges

In this post we will cover how to select, train, and assess standard setting judges.

Selection of Judges

Since all of the standard-setting methods require that some form of subjective judgment be made, the selection of judges is important to all types of methods. Many authors advocate for the use of a wide range of judges in the standard-setting process (Norcini et al, 1988). The cutscore will be more acceptable to the various stakeholders affected by the test if they had a voice in the standard-setting process. The need to include numerous stakeholding groups is especially important when the standards are addressing politically charged issues (Shepard, 1980). In addition to the political reasons, many test content areas are broad enough that it may be difficult to find persons that are “experts” in the entire breadth of the test (Norcini et al., 1988). According to Norcini et al. “the domain of knowledge in most professions is too broad to be mastered by any single individual” (p. 60). Kane (1994) stated that “this interest in having broad participation in the standard-setting process may be in conflict with the requirement that the judges be qualified to make the kind of decision they are being asked to make, and, therefore, a judicious trade-off may be called for” (p. 441). As a solution, Kane has suggested that judges be asked to rate only the items for which they have at least a minimal level of expertise (Kane, 1994). Several articles have also shown the importance of using a reasonable number of judges (Jaeger, 1991; Norcini, Shea, & Grosso, 1991). Too many or too few judges can have an adverse effect on the ratings and their reliability.

Identifying Problematic Judges

In addition to selecting appropriate judges, the method should have a process for dealing with problematic judges. There are multiple reasons for which a judge might be considered problematic including: the judge did not understand the process, had a personal stake in the outcome, and/or was not truly the expert he/she was supposed to be (Geisinger, 1991). Strategies for handling these types of judges range from reconvening a new panel to removing selected predictions by the judges (Geisinger, 1991). Clearly, actions such as these need to be spelled out beforehand with accompanying reasoning in order to avoid unethical and/or capricious actions. Two models have been proposed to assess the effectiveness of judgments formulated during standard-setting exercises: intrajudge consistency and consensus.

Intrajudge Consistency

One of the most common methods under this model is the calculation of the correlation between a judge’s item difficulty ratings and the actual item difficulty distributions from the exam. Judges with a high correlation are considered good judges since their ratings are more consistent with the actual test data (Friedman & Ho, 1990). This method requires that the actual item difficulties for each item be available prior to the standard-setting session. For many studies, the actual item statistics are not available since the measure may not have been administered to test candidates.


This model was derived from anthropological literature and few techniques have been developed for use in the field of standard-setting (Jaegger, 1988). This model assumes that “better informants (judges) will correlate higher with each other and with the aggregate total and hence be closer to the truth” (Weller, 1984, p. 967). This idea is best illustrated by the example of how the model functions provided by Romney, Weller, and Batchelder (1986). They stated that given a set of questions regarding tennis, and supposing there were two sets of informants, tennis players and non-tennis players, “we would expect that the tennis players would agree more among themselves as to the answers to questions than would the non-tennis players. Players with complete knowledge about the game would answer questions correctly with identical answers or maximum consensus, while players with little knowledge of the game would not” (p. 316).

Based on this model, Jaegger (1988) developed a technique he titled the “Modified Caution Index.” This technique allows the identification of judges “whose patterns of item judgments were aberrant when compared with the pattern produced by the group of judges as a whole” (p. 19). Judges whose response patterns are different from the group may not possess the knowledge, skills, or understanding possessed by the group, and may make poor judgments. Jaegger recommended that before the final cutscore is set, the effect on the cutscore of the judgments provided by judges who are different from the group should be assessed. The computation of the Modified Caution Index requires the use of the cutscore setting technique developed by Jaegger. This technique has the judges determine whether or not every person taking the test should be able to get the answer correct or not. The items are then scored in a dichotomous form. Each judge’s recommended standard is the sum of all of his or her positive responses (Jaegger, 1988). Since not all of the standard-setting methods can be scored in a dichotomous form, or use the Jaegger technique, the Modified Caution Index is limited in its application.

Training of Judges

The overall quality of the standard can be greatly affected by the training the judges receive in the procedure. Since at some level, all methodologies have an element of subjectivity or judgment involved in the process, judges need to be well trained in the process in which they will participate in order to minimize the effect of subjectivity and to allow replication.

Since most standard-setting studies provide very few details regarding training of judges, it can be difficult to assess the quality of the training and the resulting standard (Reid, 1991). Standard components of a good training include an in-depth description of the purpose and outcomes of the standard-setting meeting, an in-depth discussion of the target group to be rated, practice completing the judgment tasks, and performance feedback. According to Reid, ratings from well-trained judges should be stable over time, consistent with the actual performance of test takers, and ratings should reflect realistic expectations (p. 13).

A frequently overlooked part of the training is the discussion about the target group. Far too frequently judges are left to create their own conceptualizations of the minimally qualified candidate, or the definition provided is vague and allows for individual interpretation. An example of a vague definition of a minimally qualified candidate would be “a minimally qualified candidate for this test is someone with the skills to function in society.” Each judge would likely have a different conception of the skills of a minimally qualified candidate depending on the judge’s view of what skills are needed to function in society. As part of the training provided, a clear definition of the minimally qualified candidate must be established. One potential method for accomplishing this is to take each major content area within the test and discuss what a minimally qualified candidate would know in each of these content areas (Mills et al., 1991). Following this discussion each judge should have a list of the knowledge level and attributes of a minimally qualified candidate to use as reference while rating the items. Group Discussions

Several methods use group discussion of ratings as part of the rating process. Group discussion allows judges to hear the reasons and rationale of the other judges. Since initial judgments may be quickly conceived, allowing judges to hear the reasons and rationale from the other judges allows judges to consider information not previously considered and then alter their initial judgments. According to Busch and Jaeger (1990), “judges are likely to be more reflective and to incorporate a larger range of pertinent information when they are given more than one opportunity to consider their standard-setting recommendations” (p. 148). Several studies have also found that group discussion can result in a reduction of variability of test standards (Busch & Jaeger).

Fitzpatrick (1989) identified several potential problems related to group discussions. First, when judges favor one side of an issue before the group discussion, after the group discussion they may alter their initial ratings to favor their initial opinions in a more extreme manner. In effect, the group discussion may serve to further polarize opinions. Secondly, during group discussion judges may engage in social comparison. Highly influential judges may bias the ratings in the direction advocated by the influential judge. Fitzpatrick summarized this finding by stating, “Thus, it seems that standard-setting procedures should be designed to both minimize the effects of social comparison and maximize the effects of certain informational influences on the decisions to be made” (p. 322).

In order to maximize the informational influences while limiting the negative effects, Fitzpatrick recommended:

  1. Having the judges discuss their opinions without revealing the individual ratings given to the test question.
  2. If possible, have the judges record their opinions privately and then later share them with the group in a controlled manner.
  3. Since polarization is more likely to occur for subjective questions as opposed to objective questions, judges may be less likely to polarize their opinions if they are given item performance data to guide their predictions (pp. 323-324).


Busch J. C., & Jaeger, R. M. (1990). Influence of type of judge, normative information, and discussion on standards recommended for the National Teachers Examinations. Journal of Educational Measurement, 27(2), 145-163.

Fitzpatrick, A. R. (1989). Social influences in standard-setting: The effects of social interaction on group judgments. Review of Educational Research, 59(3), 315-328.

Friedman, C. B., & Ho, K. T. (1990, April). Interjudge consensus and intrajudge consistency: Is it possible to have both in standard-setting? (Report No. TM-015278). Paper presented at the Annual Meeting of the National Council on Measurement in Education, Boston MA.. (ERIC Document Reproduction Service No. ED 322-164)

Geisinger, K. F. (1991). Using standard-setting data to establish cutoff scores. Educational Measurement: Issues and Practice, 10(1), 17-19.

Jaeger, R. M. (1988). Use and effect of caution indices ion detecting aberrant patterns of standard-setting judgments. Applied Measurement in Education, 1(1), 17-31.

Jaeger, R. M. (1991). Selection of judges for standard-setting. Educational Measurement: Issues and Practice, 10(1), 3-6, 10,14.

Kane, M. (1994). Validating the performance standards associated with passing scores. Review of Educational Research, 64(3), 425-461.

Mills, C. N., Melican, G. J., & Ahluwalia, N. T. (1991). Defining minimal competence. Educational Measurement: Issues and Practice, 10(1), 7-10.

Norcini, J. J., Shea, J. A., & Kanya, D. T. (1988). The effect of various factors on standard-setting. Journal of Educational Measurement, 25(1), 57-65.

Norcini, J. J., Shea, J. A., & Grosso, L. (1991). The effect of numbers of experts and common items on cutting score equivalents based on expert judgment. Applied Psychological Measurement, 15(3), 241-246.

Reid, J. B. (1991). Training judges to generate standard-setting data. Educational Measurement: Issues and Practice, 10(1), 11-14.

Romney, A. K., Weller, S. C., & Batchelder, W. H. (1986). Culture as consensus: A theory of culture and informant accuracy. American Anthropologist, 88, 313-338.

Shepard, L. A. (1980). Standard-setting issues and methods. Applied Psychological Measurement, 4(4), 447-467.