##### INTRODUCTION

Psychological tests can follow two frameworks for interpretation and uses of their results: Norm-referenced and criterion-referenced. With norm-referenced interpretation and uses, investigator’s interest focuses on the relative ordering of examinees with respect to the performance for the norm group which the examinee is associated.^{1} In generalizability theory framework, relative error scores variance is defined as the expected squared difference between an examinee’s observed deviation score (from examinee’s true score) and the associated group’s observed deviation score. On the other hand, criterion-referenced interpretation suggests that the investigator’s interest focuses on absolute interpretations of scores and absolute error scores variance.^{1}^{,2,3} Relative error scores variance is defined as the expected squared difference between an examinee’s observed deviation score and the examinee’s true score.^{4}

Since the first distinction between norm-referenced and criterion-referenced interpretations of test results, many researchers including Glaser and Nitko^{5} and Popham and Husek^{6 }argued that reliability coefficients in the classical test theory are appropriate for norm-referenced tests. These coefficients (such as KR-20^{7} and coefficient alpha^{8}) depend on the relative standing of an examinee on a norm group.^{9,10}

Kane and Brennan^{11} introduced a very useful general agreement function that is used to summarize different existing agreement coefficients for different uses and interpretations of test scores. Using this general agreement function, Kane and Brennan^{10} defined the norm-referenced expected agreement coefficient for norm-referenced tests (called generalizability coefficient) with generalizability theory framework. Using the general linear model for design (all examinees take same set of items) in generalizability theory for examinee’s observed score on each item, , on a sample of items, Brennan and Kane derived the agreement coefficient for norm-referenced interpretation and showed that the estimator of this coefficient is equal to coefficient alpha developed by Cronbach.^{8}

The concept of expected agreement and its derivation method is very useful to understand test results and enhance its interpretation and uses.^{12} It helps to differentiate examinees’ error scores and accordingly examinees’ true scores and test score reliability. Brennan^{1} explained that norm-referenced agreement coefficient is associated with relative error scores whereas criterion-referenced agreement coefficient is associated with absolute error scores. The two types of error scores differ in their definition and implication when estimating and interpreting test score reliability.

The current application and utilization of the expected agreement is limited to generalizability theory frameworks. However, generalizability theory involves both theoretical and practical complexities.^{1} It is based on mixture of concepts of variance components in analysis of variance and concepts of classical test theory. Similarly, the estimation of the expected agreement coefficient requires estimation of mean squares.^{1}^{,13}

On the other hand, classical test theory is based on simpler concepts and estimation methods that are appreciated by many practitioners.^{4} The advantages and application of expected agreement are not yet introduced within classical test theory. One possible reason behind delaying usages of expected agreement coefficient in classical test theory might be traced to its conventional definition of equivalent test forms.

The paper introduced the expected agreement for norm-referenced interpretations of test scores within classical test theory framework. The paper presents the context and assumptions of randomly equivalent test forms that are necessary to develop the expected agreement coefficients. The paper derived the expected agreement/reliability coefficient for norm-referenced tests utilizing the general agreement coefficients pioneered by Kane and Brennan.^{11} Moreover, the estimator of this expected agreement coefficient was outlined.

##### METHOD

**Procedure**

The paper used the procedure outlined by Kane and Brennan^{11} for deriving the expected agreement between two randomly selected instances of a testing procedure. The procedure assumes that the instances or tests are randomly selected from a universe of possible instances, which support the assumption that the expected distribution of outcomes for the population is believed to be the same for each administration of the testing procedure. The agreement function, *a(S*_{pi},S_{pj}) defines the degree of agreement between any two scores of an examinee on two testing procedures, *S*_{pi} and *S*_{pj}. This agreement function can take any form as long it satisfies three conditions:

(1) a(S_{pi},S_{pj})≥0,

(2) a(S_{pi},S_{pj})=a(S_{pj},S_{pi}), and

(3) a(S_{pi},S_{pi})+a(S_{pj},S_{pj})≥2a(S_{pi},S_{pj}).

Two general agreement indices of instances for the testing procedure are defined: One is corrected for chance while the other is not corrected. The index of agreement which is not corrected for chance is:

The term *A* is the expected agreement given by A=E_{p,I,J}a(S_{pI},S_{pJ}), where the expectation is taken over the population of examinees and over pairs of tests that are independently sampled from the universe of tests and administered to the same population of examinees. The term *A*_{m} is the expected agreement between the instance of the testing procedure and itself, *A*_{m}=E_{p,I} a(S_{pI},S_{pI}), where *A*_{m} represents the maximum value of *A*. *A* is equal to *A*_{m} when each examinee in the population has the same score on every test. Kane and Brennan noted that* A*_{m} corrects the problem of the dependence of on the scale of *a(S*_{pI},S_{pJ}).

The index of agreement which is corrected for chance is

where term *A*_{c} quantifies the agreement between the two instances of the testing procedure that is due solely to chance. It is defined as the expected agreement between the score, *S*_{pi}, for a random selected examinee *p* on one test and the score, S_{qj}, for another independently sampled examinee *q* on an independently another sample test. That is.,

*A*_{c}=E_{p,q,I,J} a(S_{pI},S_{qJ})=E_{p,I} a(S_{pI})E_{q,J} a(S_{qJ}).

Also, Kane and Brennan^{11} define the expected disagreement or loss as the difference between the maximum expected agreement and the expected agreement,

σ^{2} (ϵ) = L = A_{m}−A.

This expected loss gives the error score variance associated with the expected agreement function.

##### RESULTS

In order to derive the expected agreement coefficient within the context of classical test theory, we need to first introduce the concept of randomly equivalent test forms instead of the classical equivalent test forms. Randomly equivalent test forms is evident when the test developer is able to build a very large or infinite number of different test forms from a large pool of items measuring the psychological construct. Hence, test forms of equal size are considered randomly equivalent forms if each is sampled randomly and independently from the large pool of items. These test forms are not expected to have equal mean scores nor equal variance. However, examinees error scores from these randomly equivalent test forms are expected to be uncorrelated. Moreover, it is assumed that any test form is administered to a large sample of examinees that are randomly selected from the population of examinees.

In order to derive the expected agreement/reliability of test scores on test form (say form X), we need to hypothesize that this test form and another hypothesized form (say form Y) are randomly equivalent test forms with different items but equal in terms of size (form X with I items and form Y with J items). Let us refer form X as a reference test form and the other test form (form Y) as a hypothesized test form. These two forms are then administered to the same sample of examinees of size N.

For a norm-referenced test where the decision is based on the relative position of examinees to their peer examinees, the agreement function is defined as the expected product of relative distance of the observed average scores ( X̅_{p} and Y̅_{p}) on two randomly equivalent test forms from the associated mean score for items on each test form ( T_{I} and T_{J}) over all examinees.

A(r) = E_{P,I,J} (X̅_{p}−T_{I} )(Y̅_{p}−T_{J}) = E_{P,I,J} ∑_{i}∑_{j}(X_{pi}−T_{i})(Y_{pj}−T_{j})

= E_{P,I,J} (X_{pi}−T_{i} )(Y_{pj}−T_{j})

where the expectation is over infinite randomly equivalent test forms of *X* and *Y*; each with equal number of items from the domain, over infinite randomly independent samples of *N* examinees from the population, and E_{P,I,J}(X_{pi}-T_{i})(Y_{pj}-T_{j}) is the expected mean pair wise covariance of items on *X* with items on *Y* with relative to their individual item mean scores.

For the reference test form *X*, E_{P,I}(X_{pi}−T_{i})(X_{pi’}−T_{i’}) represent the expected mean pair wise covariance of distinct items on *X* (i≠i’) with relative to their mean scores. Similarly, let E_{P,J} (Y_{pj}−T_{j} )(Y_{pj’}−T_{j’}) have similar definition for items on test form *Y*. Because of randomly equivalent test forms,

E_{P,I,J}(X_{pi}−T_{i})(Y_{pj}−T_{j}) = E_{P,I}(X_{pi}−T_{i} )(X_{pi’}-T_{i’}) = E_{P,J}(Y_{pj}−T_{j})(Y_{pj’}−T_{j’})

Hence, the expected agreement function, *A(r)* , becomes

A(r) = E_{P,I}(X_{pi}−T_{i})(X_{pi’}−T_{i’} ) = E_{P,I}∑∑_{i≠i’}(X_{pi}−T_{i} )(X_{pi’}−T_{i’})

By simple algebra, *A(r)* becomes,

where T_{I}=E_{I} ∑_{i}T_{i}.

This expected agreement function gives the true score variance for norm-referenced tests, σ^{2}(T_{r}). The maximum expected agreement for norm-referenced testing is,

A_{m}(r) = E_{P,I}(X̅_{p}−T_{I} )(X̅_{p}−T_{I}) = E_{P,I}(X̅_{p}−T_{I})^{2}

The expected agreement for norm-referenced testing due to chance is,

A_{c}(r) = E_{P,Q,I,J}(X̅_{p}−T_{I} )(Y̅_{q}−T_{J} ) = E_{P,I}(X̅_{p}−T_{I}) E_{Q,J }(Y̅_{q}−T_{J})

= E_{P,I }( ∑_{i}X_{pi} − ∑_{i}T_{i}) E_{Q,J }( ∑_{j}Y_{qj }− ∑_{j}T_{j}) = 0,

because E_{P} (X_{pi}) = T_{i} and E_{Q} (Y_{qj}) = T_{j}. Hence, the norm-referenced agreement coefficient is,

θ(r) = θ_{c}(r) =

or θ(r) = θ_{c}(r) =

This coefficient can be also written as,

θ(r) = θ_{c}(r) =

This result suggests that the correction for chance agreement has also no effect on the norm-referenced agreement.

The expected loss associated with the norm-referenced agreement coefficient is,

L(r) = A_{m}(r) − A(r) = [E_{I} ∑_{i}E_{P} (X_{pi}−T_{i}) − *n*E_{P,I}(X̅_{p}−T_{I} )^{2}]

= E_{P,I }∑_{i}((X_{pi}−T_{i}) − (X̅_{p}−T_{I}))^{2}

which equals the appropriate error score variance for norm-reference d tests, σ^{2}(ϵ_{r}).

This error score variance is similar to the relative error score variance identified by Brennan and Kane^{2} using Generalizability theory. This quantifies the expected squared difference between each examinee’s observed deviation score from the test average score and the deviation of an examinee’s true score from the test average score on the domain of items.

##### ESTIMATION

The components of all expressions of the expected agreement/reliability coefficients have the form of expected value of some terms over different random sets of items from the domain of items and over different random samples of examinees from the population of examinees. The sample counterparts of these terms can be used to estimate these expected values.

The expected norm-referenced agreement/reliability coefficients can be estimated by collecting data from administering one test form of *n* items to a representative sample of N examinees. If we substitute (X̅_{p}), T_{i} and T_{I} by their sample counterparts, x̅_{p}= ∑_{i}*x*_{pi}, x̅_{i}= ∑_{p}*x*_{pi}, and x̅ = ∑_{i}x̅_{i} = ∑_{p}x̅_{p} respectively, the estimator of the expected agreement coefficient for norm-referenced test is,

θ(r) = = 1 −

=

The associated loss is,

L̂(r) = [∑_{i}σ^{2}(x_{pi}) − *n*σ^{2} (x̅_{p})] ,

Which gives the estimator of the relative error score variance for norm-referenced test

In these equations,

σ_{ii}_{‘ }= ∑_{p}(*x*_{pi}−*x̅*_{i})(*x*_{pi’}−*x̅*_{i’}),

σ^{2}(*x*_{pi}) = ∑_{p}(*x*_{pi}−x̅_{i})^{2},

σ^{2}(*x̅*_{pi}) = ∑_{p}(*x̅*_{p}−x̅)^{2}.

##### DISCUSSION AND CONCLUSION

The paper derived the expected agreement coefficient for norm-referenced tests using classical test theory framework under the assumption of randomly equivalent test forms as replacement of the conventional equivalent test forms. The estimators of the resulted coefficient proved itself to be equal to coefficient alpha for Cronbach^{8} that was derived under different assumption of essentially tau-equivalent test form.

This result supports what Glaser and Nitko^{5} and Popham and Husek^{6} argued that reliability coefficients in the classical test theory such as coefficient alpha and KR-20 are appropriate for norm-referenced tests. The error scores associated with coefficient alpha is the relative error score variance that is defined as the difference between individual examinee’s performance and the performance of his/her peers who took the test.

The estimation of the expected agreement coefficient for norm-referenced tests can use either unbiased or biased estimators of its terms. It can be easily showed that if the biased estimators of the terms in the above equations are used, they would give identical estimates of the expected agreement coefficient for norm-reference tests. However, the estimation of the error score variances and the true score variance, however, are affected by whether the unbiased or biased sample variances are used (The unbiased estimators are preferred).