We tested that theory by comparing data from a simple category- based rating system against data from a standards-based Work Planning and Review appraisal system with over 248,000 performance appraisals of state employees. Using logistic regression and statistical definitions of prima facie discrimination, we found no support for the hypothesis that adverse impact is materially affected by criterion specificity. 2003 Wiley Periodicals, Inc. Introduction Performance appraisal is often at the center of equal employment opportunity litigation involving promotions and terminations (Bernardino & Tyler, 2001).
The recent heavily publicized Title VII cases against Collar, the Ford Motor Company, Boeing, Texaco, Circuit City, Public Super Markets, Motel 6, Smith Barney, Home Depot, and Wend)/s, for example, all unconcerned (in part) the presentation of adverse-impact statistics related to promotions and a scrutiny of the manner in which performance appraisal was conducted in the organization as related to those promotions or terminations (Malls, 1996; Emerson, 1 997; Violation, Austin, & Bernardino, 2002).
The complaint in Abdullah et al. V. Iacocca (1999), for example, stated, “Coca-Cola Company utilizes employee evaluations … That treat African- American salaried employees less favorably than similarly situated employees outside the protected group. ” Reports submitted for class certification in the ease included the calculation of adverse-impact statistics and expert testimony regarding the process of evaluating employees.
Susan Fiske submitted an expert report on behalf of the plaintiffs proclaiming that “At Home Depot, the decision-making criteria are decentralized, unspecified, vague, discretionary, not public, and not validated… ” And that such criteria specifically perpetuate stereotypes” (Fiske, 1 997, p. 24). The basis for class certification in EYE lawsuits often involves the presentation of adverse-impact statistics, expert testimony related to the appraisal system, and anecdotal evidence of discrimination.
Class certification substantially increases the defendants’ exposure to liability and the motivation to settle claims that may have little or no merit. This so-called in terror effect of certification has been successful for plaintiffs who have been the beneficiaries of large out-of- court settlements against some of the largest IS. S. Companies, which has in turn increased the rate of petitions for certification in other EYE cases (Babbles, 2002). The Coca-Cola and Ford Motor Company settlements are two recent examples.
The strategy on the part of the plaintiffs regarding performance appraisal in such cases is to foster an inference that the ambiguity in the performance-rating criteria either directly or indirectly caused the adverse impact and thus the illegal discrimination in the personnel decisions. For example, Circuit City lost a Title VII case based at least to some extent on adverse-impact statistics grading promotion decisions and expert criticism of their “highly subjective” appraisal system.
The plaintiff ‘s expert in this case rendered the opinion that among the “kinds of practices that could be expected to result in discrimination include… No weighting factors to consider in making promotion decisions and no unsighted and unanchored performance dimensions for the appraisal process. ” (McKnight. V. Circuit City Stores Inc. , 1997, p. 3). The same “subjectivity theory’ is also proffered in disparate treatment cases of discrimination to buttress claims of intentional discrimination (Kane et al. 1998). The Supreme Court held in Watson v.
Fort Worth Bank and Trust (1988) that the plaintiff ‘s burden in establishing prima facie discrimination “… Goes beyond the need to show that there are statistical disparities in the employer’s work force; plaintiff must identify specific employment practices allegedly responsible for observed statistical disparities and prove causation. ” Since this 1988 decision, there has been a great increase in the use of statistics and expert testimony to argue a causal connection between the subjectivity in the appraisal process and deleterious outcomes to support individual claims of discrimination (Violation et al. 2002; Emerson, 1997). The presentation of adverse-impact statistics along with expert opinion criticizing the subjectivity of the decision-making process and some anecdotal evidence of discrimination is also often the basis of successful petitions for class certification despite the Supreme Courts emphasis on the “commonality’ and “predominance” requirements of Rule 23 of the Federal Rules of Civil Procedure in General Telephone v.
Falcon (1982) and Emcee Products, Inc. V. Windsor (1997). The “commonality’ requirement stipulates that specific questions of law or fact must be common to members of a class Performance Appraisal Criterion Specificity and Discrimination 145 and that such questions must predominate over any questions affecting individual members.
Footnote 15 in General Telephone, often cited to us port the “commonality’ argument for certification, states that “significant proof that an employer operated under a general policy of discrimination conceivably could justify a class of both applicants and employees if the discrimination manifested itself in hiring and promotion practices in the same general fashion, such as through entirely subjective decision making recesses. ” Expert testimony to support plaintiffs’ claims espouses the theory that defective performance appraisal systems foster the common “entirely subjective” processes necessary for certification. Subjectivity’ in performance appraisal is obviously a matter of degree. Truly objective performance criteria do exist, usually in the form of independently counted units of tangible output produced in specified periods of time. Their focus is almost exclusively on the quantity of work performed. For the majority of jobs, however, the output of work is either intangible or incremental, or the quality of the work is to easily measured in discrete units.
Qualitative dimensions of work performance can be very difficult to measure in truly objective terms. By their very nature, these qualitative dimensions must be assessed using subjective criteria. Therefore, most “more objective” performance appraisals, in the sense plaintiffs’ attorneys refer to such, simply use some form of anchoring or standards to provide additional structure to the act of qualitative evaluation, without otherwise materially changing the pattern of observation, judging, and reporting by the supervisor.
The implication of the “subjectivity” hero espoused by plaintiffs in discrimination cases is that if the company had done appraisal using more specific, precise, or objective criteria and/or had employed other “best practices” related to performance appraisal, that the discrimination, operationally defined by the adverse-impact statistics, would not have occurred. As the theory goes, at least a less disproportionate number of decisions deleterious to the plaintiffs would have been made.
For example, the theory may imply that the 80% rule would not have been violated or that a greater proportion of protected class members would have been promoted were it to for the relatively more subjective criteria used to make decisions about personnel. This was precisely the argument made by the plaintiffs in the recently settled race discrimination case against Coca-Cola and the gender discrimination case against Home Depot. It is clear that such testimony has an impact on the outcome of EYE cases and out-factor settlements (e. . , Bernardino & Tyler, 2001; Bernardino & Socio, 1988; Socio & Bernardino, 1980; Field & Holey, 1982; congener & post, 2000; Mallory 1998; EMCEE-boy & geek- Dudley, 1991; Ritchie & Life, 1994). The problem, however, is that the limited, published research relevant to the general expert theme that the rating instrument or the specificity of the criterion measures is related to negative outcomes for protected class members does not support this definitive view (Powell & Butterflies, 1997).
Bernardino et al. (1995) found little research supporting the notion that criterion specificity will result in greater accuracy and less rating bias (e. G. , Walden & Viola, 1986) with consequent reductions in adverse impact. It is not unreasonable to assume that a racist, sexist, or ageist rater is more likely to manifest these tendencies in personnel sections when the performance criteria that are supposed to be the basis Of these decisions are relatively more ambiguous.
Substantial research effort has been directed at the specific issue of isolating the effect of recessed rater bias, and most of it has concluded that raters tend to rate persons of their own race higher. Krieger and Ford’s (1985) meta-analysis summarized a large amount of the research on this issue, as did Populous, White, People, and Barman’s largesse study of military ratings (1989) and the re-analysis of the same data by People, Campbell, Populous, and Barman (1992).
The analyses of he military ratings revealed significant interactions between rater and rate race when variance in narration measures was removed from the ratings. Jackets and Dubious (1 991 ) challenged these results based on re-analysis of the Pu- … Most “more objective” performance appraisals, in the sense plaintiffs’ attorneys refer to such, simply use some form of anchoring or standards to provide additional structure to the act of qualitative evaluation, without otherwise materially changing the pattern of observation, judging, and reporting by the supervisor. 46 The purpose of this research was to investigate the impact of performance retention specificity on adverse impact resulting from performance appraisal. Lacks et al. Data to isolate supervisor-subordinate ratings from peer ratings, and with the inclusion of additional data. Their conclusion when considering only supervisor ratings was “Black rates consistently received lower ratings than White rates from both White and Black raters.
Also notable is the fact that White and Black raters differed very little in their ratings of White rates but differed much more in their ratings of Black rates. ” Jackets and Dubious were also able to identify a substantial number of instances in which the name individual was rated by both Black and White raters, and therefore were able to conduct within-subject analyses. These were significant in that their results were virtually identical to the analyses in which random assignment of raters had to be assumed.
In all cases here, however, the rating format itself was constant, which must lead to the conclusion that any effect found was not due to the rating format. In fact, there has been little research that has directly assessed the extent to which rating content, format, or criterion specificity moderated the statistical relationship between rate race, gender, r age and personnel decisions. In the only related materialness on the subject, Ford, Krieger, and Scotchman (1986) reported nearly identical correlations between race and performance for objective criteria (e. . , absenteeism, productivity) and subjective criteria (e. G. , ratings). They concluded, “the relatively high degree of consistency in overall effect sizes found across multiple criterion measures suggests that the race effects found in subjective ratings cannot be solely attributed to rater bias” (p. 334). However, in their comparison of the effect sizes for subjective ratings versus objective indices of performance (e. . , productivity, customer complaints), significantly stronger race effects were found for the subjective ratings.
Their meta-analysis does not compare differing subjective appraisal systems for race effects, which is the focus of the present work. Other research has actually concluded that the greatest differences between African-Americans and Whites occur on relatively more objective performance measures such as knowledge tests and work samples (Bernardino, 1 984; People, Campbell, Populous, & Barman, 1992). Bernardino et al. (1995) found only four studies that specifically addressed the criterion issue using real reference appraisal data, all of which were small sample studies.
No published study Was located that made comparisons between performance appraisal systems for their effects on adverse impact. Hellman, Block, and Staccatos (1997) found a strong affirmative action “stigma” against women in situations in which the performance criteria were relatively more ambiguous and no such stigma when the performance criteria were clear and unambiguous. However, like many studies involving illustrations of various forms of rating bias (e. G. , Huber, 1 989; Lenten, Mitchell, & Browning, 1983), his was not a study involving real performance appraisal data.
Rather, insurance agents rated hypothetical computer programmers after reviewing very limited information about the programmers’ performance to be rated and labels for some of the hypothetical performers as “hired through women/ minority recruiting program. ” While strong effects were found for the “stigma” theory, because of the contrived nature of the study, the very limited performance information, and the labeling manipulation, we do not believe this study provides much support for the theory that greater criterion specificity will reduce or eradicate discrimination.
Field research is needed to assess the effects of appraisal characteristics and criterion specificity on protected class outcomes using administratively significant appraisal data. Borrowing from the expert theme and the models presented by the EEOC and described below, our overall hypothesis is that greater appraisal criterion specificity will result in less adverse impact against protected class members in performance appraisal ratings. We tested this hypothesis using a large database Of performance appraisals from a State government.
Our study sheds some light on this important issue using performance appraisal data room two very different appraisal systems that dif- 147 fear on criterion specificity: a less-specific category-based system with simple adjectival descriptions of performance levels versus a more specific standards-based Work Planning and Review (WAP&R) system where unique performance standards must be written for each rate and the rating criterion levels are carefully defined for the occupation.
Accepted definitions of adverse impact, such as the 80% rule, contemplate dichotomous decisions: the employee is hired or not hired, fired or not fired. However, performance appraisals tend to include intermediate gradations. This arises from the dual nature of the use of performance appraisals, which has been the subject of long discussion in that literature (e. G. Meyer, Kay, & French, 1964). Performance appraisals are used for developmental feedback to employees, which typically requires finer gradations in measurement than a simple acceptable/unacceptable.
But, they are also used as the basis and formal justification for personnel actions-? and these are the binary decisions that may result in adverse impact. Receiving a lowermost-expected performance appraisal may be emotionally costly to the employee by itself, but it would not eave a measurable economic impact and would not be actionable unless the employer changed some element of compensation or the employment relationship based on the results of that appraisal.
Adverse impact in compensation level is another non-dichotomous variable. Current EEOC practice, as illustrated in Chapter 10 of the EEOC Compliance Manual (U. S. Equal Employment Opportunity Commission, 2000), calls for the calculation of median compensation rate, and the partition of individual pay rates into those above the median and those at or below the median, in order to dichotomize the data for conventional adverse-impact analysis.
We applied the same partitioning strategy to the performance appraisal ratings under each of the rating types we investigated. Using this dichotomize outcome variable, we were then able to formulate hypotheses that are consistent with the usual adverse-impact measures, and also to hypothesize more complex relationships between demographics, rating format, and the outcome. This dichotomizing has an additional benefit, in that it allows us to compare the results of ratings made using two very different scales of measurement in a single analysis.
With the use of a single pass/fail criterion, it becomes possible to test for the interaction of rating format and race, gender, or age, and directly answer the question of whether or not the change to a more specific rating format offers advantages in the reduction of adverse impact in ratings. The presence of a meaningful interaction between a basis for discrimination and type of rating would indicate, depending on the type of interaction, that there was a differential effect due to both rating type and the discriminatory effect.
The absence of such an interaction would not preclude effects due solely to rater bias or rating format on rating levels, but there would be little reason to include that one was linked to the other. This is the central point of the expert witness arguments described above. We hypothesize generally that the criterion specificity of the rating systems under consideration will moderate the effect of race, gender, or age on the achievement of above- median performance evaluation rating.
Hypothesis 1: Ratings made under the Work Planning and Review (WAP&R) system will exhibit less adverse impact in performance evaluation on Black rates than ratings made under the category ratings system. Hypothesis 2: Ratings made under the Work Planning and Review (WAP&R) system will exhibit less adverse impact in reference evaluation On female rates than ratings made under the category ratings system. Hypothesis 3: Ratings made under the Work Planning and Review (WAP&R) system will exhibit less adverse impact on rates aged 40 and older than ratings made under the category ratings system.
Methods Sample We had access to the computerized records of performance appraisals for all employees We hypothesize generally that the criterion specificity of the rating systems under consideration will moderate the effect of race, gender, or age on the achievement of above-median performance evaluation rating. 48 The two appraisal systems under study here were a simple, pinpoint category’s rating format and a Work Planning and Review (WAP&R) system. Of a large southern state government for a seven-year period.
In the third year of the database, the state began a conversion, on an agency-by-agency basis, to a standardized, WAP&R type of performance appraisal process from its previous simple, category-based system. While extensive in number, the records provide only limited information. Besides the rating itself, we had available the date of the rating and demographic and identifying information n the rated person only. The files did not contain identifying information on raters, so we were unable to examine rater-rate interactions across the transition.
We selected eight occupational groups with the largest numbers of available rating events for analysis. These naturally coincide with the groups with the most incumbents. We statistically controlled for occupational group by including dummy-coded covariates in our regression analyses in order to respect the differences in job content and context that are reflected in standards-based appraisal formats. For consistency, we controlled for the name effects in the regression analysis of category-based ratings, even though a common rating format and scales were used across all occupation groups.
For analyses of racial bias, we restricted our analysis to African- American and White employees only, since the records available to us reflected comparatively small numbers of other ethnic groups. The combination of these restrictions reduced our sample size to 12,177 category- based rating events and 236,693 standards-based rating events made on 69,026 unique individuals. These individuals were 69. 76% White and 30. 24% African-American, 59. 34% female, and 51. 12% over 40 years of age. We outmoded race, sex and age as O, 1 variables. In each case, 1 was assigned to the category of interest: African-American, female, or person over 40.
When ratings of the same individuals were compared across the two rating formats (same person, subsequent year) the Pearson r was . 3414 (p < . 0001 , n = 1 1 286) and the polychoric correlation (an index of association between ordinal variables) was . 4520 (ASE < . 0104, n = 11,286), indicat- ing substantial but not extreme consistency in ratings across occasions and rating formats. We were reluctant to pursue this analysis further, since we could not identify the raters in this sample. An analysis that controlled for both the rater and the rating instrument would be a more powerful test of our hypotheses, but that was not available to us.
As noted above, Jackets and Dubious (1991) have found support for the safety of the assumption of random assignment of raters when the sample is as large as this one. Instruments The two appraisal systems under study here were a simple, five- point category-based rating format and a Work Planning and Review (WAP&R) system. The category-based approach called for a summary rating with the following criterion anchors: “outstanding,” “above satisfactory,” “satisfactory,” conditionally satisfactory,” and “unsatisfactory. ” No other information was made available to raters for defining these performance levels.
This summary rating was used for major personnel decisions such as promotions, probationary decisions, and terminations. Prior to making the summary rating, raters also made ratings on the quantity and quality of work using the same undefined criteria and no definitions Of either quality or quantity. The WAP system called for the generation of performance standards for each rate. Guidelines for this system were developed following procedures scribed by Carlyle and Ellison in Appendix B of Bernardino & Beauty (1 984, up. 343-348).
This appendix describes guidelines for writing performance standards, selecting standards, designating critical elements, deriving evaluative weights, and making summary ratings. Raters, in consultation with the rate, defined acceptable standards of performance on all critical elements for a particular performance period. Each performance standard had to be defined in terms of a specific measure of quantity, quality, cost, or timeliness. These guidelines also prescribed five criteria for developing and 49 critiquing performance standards and provided examples of acceptable standards.
All standards had to be written prior to the start Of an appraisal period and rates had to indicate that they had reviewed each standard and agreed with its use for evaluation. Standards were written to describe a fully satisfactory level of performance. Each rater and rate was required to write and agree on at least four performance standards, at least one of which had to be designated as a “critical element. ” The guidelines provided four criteria for designating a required element as “critical.
Raters and rates also agreed on the relative importance of all performance elements and distributed 1 00 points among the standards. One review of the quality of the performance standards found that over 65% of a sample of the standards written under the new WAP&R system met the criteria for acceptable standards presented in the training guidelines (Bernardino, Hogan, & Kane, 1 998) In addition, an average of over five performance standards was written per rate. An example of an acceptable performance standard is “all legal briefs are submitted to the Court pursuant to their imposed deadlines.
At the conclusion of the appraisal period, raters were required to make a judgment on a three-point scale defined as “exceeds the standard,” “achieves the standard,” or “below the standard. ” “Exceeds the standard” was defined as “consistently exceeds the fully satisfactory standard” and “achieves the standard” was defined as “consistently achieves the fully satisfactory level of performance. ” After performance on all standards were evaluated, raters then made a summary rating on the extent to which the standards were exceeded or achieved using the same threatening scale.