Psychology and Cognitive Sciences

Open journal

ISSN 2380-727X

Expected Agreement Coefficient for Norm-Referenced Tests With Classical Test Theory

Rashid S. Almehrizi*

Rashid Saif Almehrizi, PhD

Associate Professor, Educational Measurement and Statistics, Department of Psychology, Assessment and Technical Support Unit, Director, College of Education Sultan Qaboos University, P.O. Box: 32, PC: 123 Al-Khoudh, Sultanate of Oman; Tel. +968 2414 1613; E-mail:


Psychological tests can follow two frameworks for interpretation and uses of their results: Norm-referenced and criterion-referenced. With norm-referenced interpretation and uses, investigator’s interest focuses on the relative ordering of examinees with respect to the performance for the norm group which the examinee is associated.1 In generalizability theory framework, relative error scores variance is defined as the expected squared difference between an examinee’s observed deviation score (from examinee’s true score) and the associated group’s observed deviation score. On the other hand, criterion-referenced interpretation suggests that the investigator’s interest focuses on absolute interpretations of scores and absolute error scores variance.1,2,3 Relative error scores variance is defined as the expected squared difference between an examinee’s observed deviation score and the examinee’s true score.4

Since the first distinction between norm-referenced and criterion-referenced interpretations of test results, many researchers including Glaser and Nitko5 and Popham and Husek6 argued that reliability coefficients in the classical test theory are appropriate for norm-referenced tests. These coefficients (such as KR-207 and coefficient alpha8) depend on the relative standing of an examinee on a norm group.9,10

Kane and Brennan11 introduced a very useful general agreement function that is used to summarize different existing agreement coefficients for different uses and interpretations of test scores. Using this general agreement function, Kane and Brennan10 defined the norm-referenced expected agreement coefficient for norm-referenced tests (called generalizability coefficient) with generalizability theory framework. Using the general linear model for  design (all examinees take same set of items) in generalizability theory for examinee’s observed score on each item, , on a sample of  items, Brennan and Kane derived the agreement coefficient for norm-referenced interpretation and showed that the estimator of this coefficient is equal to coefficient alpha developed by Cronbach.8

The concept of expected agreement and its derivation method is very useful to understand test results and enhance its interpretation and uses.12 It helps to differentiate examinees’ error scores and accordingly examinees’ true scores and test score reliability. Brennan1 explained that norm-referenced agreement coefficient is associated with relative error scores whereas criterion-referenced agreement coefficient is associated with absolute error scores. The two types of error scores differ in their definition and implication when estimating and interpreting test score reliability.

The current application and utilization of the expected agreement is limited to generalizability theory frameworks. However, generalizability theory involves both theoretical and practical complexities.1 It is based on mixture of concepts of variance components in analysis of variance and concepts of classical test theory. Similarly, the estimation of the expected agreement coefficient requires estimation of mean squares.1,13

On the other hand, classical test theory is based on simpler concepts and estimation methods that are appreciated by many practitioners.4 The advantages and application of expected agreement are not yet introduced within classical test theory. One possible reason behind delaying usages of expected agreement coefficient in classical test theory might be traced to its conventional definition of equivalent test forms.

The paper introduced the expected agreement for norm-referenced interpretations of test scores within classical test theory framework. The paper presents the context and assumptions of randomly equivalent test forms that are necessary to develop the expected agreement coefficients. The paper derived the expected agreement/reliability coefficient for norm-referenced tests utilizing the general agreement coefficients pioneered by Kane and Brennan.11 Moreover, the estimator of this expected agreement coefficient was outlined.



The paper used the procedure outlined by Kane and Brennan11 for deriving the expected agreement between two randomly selected instances of a testing procedure. The procedure assumes that the instances or tests are randomly selected from a universe of possible instances, which support the assumption that the expected distribution of outcomes for the population is believed to be the same for each administration of the testing procedure. The agreement function, a(Spi,Spj) defines the degree of agreement between any two scores of an examinee on two testing procedures, Spi and Spj. This agreement function can take any form as long it satisfies three conditions:

(1) a(Spi,Spj)≥0,

(2) a(Spi,Spj)=a(Spj,Spi), and

(3) a(Spi,Spi)+a(Spj,Spj)≥2a(Spi,Spj).

Two general agreement indices of instances for the testing procedure are defined: One is corrected for chance while the other is not corrected. The index of agreement which is not corrected for chance is:


The term A is the expected agreement given by A=Ep,I,Ja(SpI,SpJ), where the expectation is taken over the population of examinees and over pairs of tests that are independently sampled from the universe of tests and administered to the same population of examinees. The term Am is the expected agreement between the instance of the testing procedure and itself, Am=Ep,I a(SpI,SpI), where Am represents the maximum value of A. A is equal to Am when each examinee in the population has the same score on every test. Kane and Brennan noted that Am corrects the problem of the dependence of  on the scale of a(SpI,SpJ).

The index of agreement which is corrected for chance is


where term Ac quantifies the agreement between the two instances of the testing procedure that is due solely to chance. It is defined as the expected agreement between the score, Spi, for a random selected examinee p on one test and the score, Sqj, for another independently sampled examinee q on an independently another sample test. That is.,

Ac=Ep,q,I,J a(SpI,SqJ)=Ep,I a(SpI)Eq,J a(SqJ).

Also, Kane and Brennan11 define the expected disagreement or loss as the difference between the maximum expected agreement and the expected agreement,

σ2 (ϵ) = L = Am−A.

This expected loss gives the error score variance associated with the expected agreement function.


In order to derive the expected agreement coefficient within the context of classical test theory, we need to first introduce the concept of randomly equivalent test forms instead of the classical equivalent test forms. Randomly equivalent test forms is evident when the test developer is able to build a very large or infinite number of different test forms from a large pool of items measuring the psychological construct. Hence, test forms of equal size are considered randomly equivalent forms if each is sampled randomly and independently from the large pool of items. These test forms are not expected to have equal mean scores nor equal variance. However, examinees error scores from these randomly equivalent test forms are expected to be uncorrelated. Moreover, it is assumed that any test form is administered to a large sample of examinees that are randomly selected from the population of examinees.

In order to derive the expected agreement/reliability of test scores on test form (say form X), we need to hypothesize that this test form and another hypothesized form (say form Y) are randomly equivalent test forms with different items but equal in terms of size (form X with I items and form Y with J items). Let us refer form X as a reference test form and the other test form (form Y) as a hypothesized test form. These two forms are then administered to the same sample of examinees of size N.

For a norm-referenced test where the decision is based on the relative position of examinees to their peer examinees, the agreement function is defined as the expected product of relative distance of the observed average scores ( X̅p and Y̅p) on two randomly equivalent test forms from the associated mean score for items on each test form ( TI and TJ) over all examinees.

A(r) = EP,I,J (X̅p−TI )(Y̅p−TJ) =     EP,I,Jij(Xpi−Ti)(Ypj−Tj)

= EP,I,J (Xpi−Ti )(Ypj−Tj)

where the expectation is over infinite randomly equivalent test forms of  X and Y; each with equal number of items from the domain, over infinite randomly independent samples of N examinees from the population, and EP,I,J(Xpi-Ti)(Ypj-Tj) is the expected mean pair wise covariance of items on X with items on  Y with relative to their individual item mean scores.

For the reference test form X, EP,I(Xpi−Ti)(Xpi’−Ti’) represent the expected mean pair wise covariance of distinct items on X (i≠i’) with relative to their mean scores. Similarly, let EP,J (Ypj−Tj )(Ypj’−Tj’) have similar definition for items on test form Y. Because of randomly equivalent test forms,

EP,I,J(Xpi−Ti)(Ypj−Tj) = EP,I(Xpi−Ti )(Xpi’-Ti’) = EP,J(Ypj−Tj)(Ypj’−Tj’)

Hence, the expected agreement function, A(r) , becomes

A(r) = EP,I(Xpi−Ti)(Xpi’−Ti’ ) =          EP,I∑∑i≠i’(Xpi−Ti )(Xpi’−Ti’)

By simple algebra, A(r) becomes,

where TI=EIiTi.

This expected agreement function gives the true score variance for norm-referenced tests, σ2(Tr). The maximum expected agreement for norm-referenced testing is,

Am(r) = EP,I(X̅p−TI )(X̅p−TI) = EP,I(X̅p−TI)2

The expected agreement for norm-referenced testing due to chance is,

Ac(r) = EP,Q,I,J(X̅p−TI )(Y̅q−TJ ) = EP,I(X̅p−TI) EQ,J (Y̅q−TJ)

= EP,I (   ∑iXpi −    ∑iTi) EQ,J (   ∑jYqj −    ∑jTj) = 0,

because EP (Xpi) = Ti and EQ (Yqj) = Tj. Hence, the norm-referenced agreement coefficient is,

θ(r) = θc(r) =

or θ(r) = θc(r) =

This coefficient can be also written as,

θ(r) = θc(r) =

This result suggests that the correction for chance agreement has also no effect on the norm-referenced agreement.

The expected loss associated with the norm-referenced agreement coefficient is,

L(r) = Am(r) − A(r) =  [EIiEP (Xpi−Ti) − nEP,I(X̅p−TI )2]

=  EP,I i((Xpi−Ti) − (X̅p−TI))2

which equals the appropriate error score variance for norm-reference d tests, σ2r).

This error score variance is similar to the relative error score variance identified by Brennan and Kane2 using Generalizability theory. This quantifies the expected squared difference between each examinee’s observed deviation score from the test average score and the deviation of an examinee’s true score from the test average score on the domain of items.


The components of all expressions of the expected agreement/reliability coefficients have the form of expected value of some terms over different random sets of items from the domain of items and over different random samples of examinees from the population of examinees. The sample counterparts of these terms can be used to estimate these expected values.

The expected norm-referenced agreement/reliability coefficients can be estimated by collecting data from administering one test form of n items to a representative sample of N examinees. If we substitute (X̅p), Ti and TI by their sample counterparts, x̅p=   ∑ixpi, x̅i=   ∑pxpi, and x̅ =    ∑ii =    ∑pp  respectively, the estimator of the expected agreement coefficient for norm-referenced test is,

θ(r) =                        = 1 −


The associated loss is,

L̂(r) =            [∑iσ2(xpi) − nσ2 (x̅p)] ,

Which gives the estimator of the relative error score variance for norm-referenced test

In these equations,

σii=      ∑p(xpii)(xpi’i’),

σ2(xpi) =      ∑p(xpi−x̅i)2,

σ2(pi) =      ∑p(p−x̅)2.


The paper derived the expected agreement coefficient for norm-referenced tests using classical test theory framework under the assumption of randomly equivalent test forms as replacement of the conventional equivalent test forms. The estimators of the resulted coefficient proved itself to be equal to coefficient alpha for Cronbach8 that was derived under different assumption of essentially tau-equivalent test form.

This result supports what Glaser and Nitko5 and Popham and Husek6 argued that reliability coefficients in the classical test theory such as coefficient alpha and KR-20 are appropriate for norm-referenced tests. The error scores associated with coefficient alpha is the relative error score variance that is defined as the difference between individual examinee’s performance and the performance of his/her peers who took the test.

The estimation of the expected agreement coefficient for norm-referenced tests can use either unbiased or biased estimators of its terms. It can be easily showed that if the biased estimators of the terms in the above equations are used, they would give identical estimates of the expected agreement coefficient for norm-reference tests. However, the estimation of the error score variances and the true score variance, however, are affected by whether the unbiased or biased sample variances are used (The unbiased estimators are preferred).

1. Brennan RL. Generalizability theory and classical test theory. Applied Measurement in Education. 2010; 24(1): 1-21. doi: 10.1080/08957347.2011.532417

2. Brennan RL, Kane MT. An index of dependability for mastery tests. J Educ Meas. 1977; 14(3): 277-289. doi: 10.1111/j.1745-3984.1977.tb00045.x

3. Brennan RL, Kane MT. Signal/noise ratios for domain-referenced tests. Psychometrika. 1977; 42(4): 609-625. doi: 10.1007/BF02295983

4. Gao X, Brennan R, Guo F. Modeling measurement facets and assessing generalizability in a large-scale writing assessment. GMAC Research Report. 2015.

5. Glaser R, Nitko AJ. Measurement in learning and instruction. In: Thorndike RL, ed. Educational measurement. Washington DC, USA: American Council on Education; 1971.

6. Popham WJ, Husek TR. Implications of criterion-referenced measurement. J Educ Meas. 1969; 6(1): 1-9. doi: 10.1111/j.1745-3984.1969.tb00654.x

7. Kuder GF, Richardson MW. The theory of the estimation of test reliability. Psychometrika. 1937; 2(3): 151-160. doi: 10.1007/BF02288391

8. Cronbach LJ. Coefficient alpha and the internal structure of tests. Psychometrika. 1951; 16(3): 297-334. doi: 10.1007/BF02310555

9. Almehrizi RS. Coefficient alpha and reliability of scale scores. Applied Psychological Measurement. 2013; 37(6): 438-459. doi: 10.1177/0146621613484983

10. Cronbach LJ, Shavelson RJ. My Current thoughts on coefficient alpha and successor procedures. Educ Psychol Meas. 2004; 64(3): 391-418. doi: 10.1177/0013164404266386

11. Kane MT, Brennan RL. Agreement coefficients as indices of dependability for criterion-referenced tests. Applied Psychological Measurement. 1980; 4(1): 105-126. doi: 10.1177/014662168000400111

12. Almehrizi R. Normalization of mean squared differences to measure agreement for continuous data. Stat Methods Med Res. 2013. doi: 10.1177/0962280213507506

13. AlKharusi H. Generalizability theory: An analysis of variance approach to measurement problems in educational assessment. J Studies Educ. 2012; 2(1): 184-196. doi: 10.5296/jse.v2i1.1227


Practical Pointers for Drug Development and Medical Affairs

Gerald L. Klein*, Roger E. Morgan, Shabnam Vaezzadeh, Burak Pakkal and Pavle Vukojevic



Prevalence and Risk Factors of Subclinical Mastitis of Goats in Banadir Region, Somalia

Omar M. Salah*, Yasin H. Sh-Hassan, Moktar O. S. Mohamed, Mohamed A. Yusuf and Abas S. A. Jimale


Use of Black Soldier Fly (Hermetia illucens) Prepupae Reared on Organic Waste

Maggot Debridement Therapy: A Natural Solution for Wound Healing

Isayas A. Kebede*, Haben F. Gebremeskel and Gelan D. Dahesa,


Figure 11. Risk Map for the Introduction of Ruminant Diseases at Borders

Ovine Network in Morocco: Epizootics Spread Prevention and Identification of the At-Risk Areas for “Peste des Petits Ruminants” and “Foot and Mouth Disease”

Yassir Lezaar*, Mehdi Boumalik, Youssef Lhor, Moha El-Ayachi, Abelilah Araba and Mohammed Bouslikhane



The Impact of Family Dynamics on Palliative Care at the End-of-Life

Neil A. Nijhawan*, Rasha Mustafa and Aqeela Sheikh


Long-Term Follow-Up After Laparoscopic Radical Prostatectomy for Localized and Locally Advanced Prostate Cancer

Shrenik J. Shah*, Abhishek Jha, Chirag Davara, Rushi Mistry and Kapil Kachhadiya




Pie Chart Showing Overall Proportions of Diagnostic Category of FNAC, JUMC

Retrospective Study

2024 Apr

Abel Tefera*, Lemlem Terefe and Kitesa Biresa
Prevalence (%) of Types of Anthropometric Failure among Previous and Present Studied Tribal Children

Original Research, peer reviewed

2024 Apr

Biswajit Mahapatra and Kaushik Bose*


2024 Apr

Gerald L. Klein*, Roger E. Morgan, Shabnam Vaezzadeh, Burak Pakkal and Pavle Vukojevic