Definition:
In statistics, inter-rater reliability, inter-rater agreement,
or concordance is the degree of agreement among raters. It gives a
score of how much homogeneity, or consensus, there is in the
ratings given by judges. It is useful in refining the tools given
to human judges, for example by determining if a particular scale
is appropriate for measuring a particular variable. If various
raters do not agree, either the scale is defective or the raters
need to be re-trained.
There are a number of statistics which can be used to
determine inter-rater reliability. Different statistics are
appropriate for different types of measurement. Some options are:
joint-probability of agreement, Cohen's kappa and the related
Fleiss' kappa, inter-rater correlation, concordance correlation
coefficient and intra-class correlation.
Different
raters can disagree about measurement results from the same object
by e.g. variations in the procedures of carrying out the
experiment, interpreting the results and, subsequently, presenting
them. All these stages may be affected by experimenter's
bias, that is, a tendency to deviate towards what is
expected by the rater. When interpreting and presenting the
results, there may be inter-rater variations
in digit preference, that is, preferences
differ whether to round off a value to a lower one or a higher
one.
There are several operational definitions of "inter-rater
reliability" in use by Examination Boards, reflecting different
viewpoints about what is reliable agreement between raters.
There are three operational definitions of agreement:
1.
Reliable raters agree with the "official" rating of a
performance.
2.
Reliable raters agree with each other.
3.
Reliable raters agree about which performance is better and which
is worse.
These combine with two operational definitions of behavior:
A.
Reliable raters are automatons, behaving like "rating machines".
This category includes rating of essays by computer. This behavior
can be evaluated by Generalizability theory.
B.
Reliable raters behave like independent witnesses. They demonstrate
their independence by disagreeing slightly. This behavior can be
evaluated by the Rasch model.
Joint
probability of agreement:
The joint-probability of agreement is probably the most simple
and least robust measure. It is the number of times each rating
(e.g. 1, 2, ... 5) is assigned by each rater divided by the total
number of ratings. It assumes that the data are entirely nominal.
It does not take into account that agreement may happen solely
based on chance. Some question, though, whether there is a need to
'correct' for chance agreement; and suggest that, in any case, any
such adjustment should be based on an explicit model of how chance
and error affect raters' decisions.
When the number of categories being used is small (e.g. 2 or
3), the likelihood for 2 raters to agree by pure chance increases
dramatically. This is because both raters must confine themselves
to the limited number of options available, which impacts the
overall agreement rate, and not necessarily their propensity for
"intrinsic" agreement (is considered "intrinsic" agreement, an
agreement not due to chance). Therefore, the joint probability of
agreement will remain high even in the absence of any "intrinsic"
agreement among raters. A useful inter-rater reliability
coefficient is expected to (1) be close to 0, when there is no
"intrinsic" agreement, and (2) to increase as the "intrinsic"
agreement rate improves. Most chance-corrected agreement
coefficients achieve the first objective. However, the second
objective is not achieved by many known chance-corrected
measures.
(Extracted
from Wikipedia,http://en.wikipedia.org/wiki/Inter-rater_agreement,
just for learning)
下面这个也行:
Interrater reliability is the extent to which two or more
individuals (coders or raters) agree. Interrater reliability
addresses the consistency of the implementation of a rating
system.
A test of interrater reliability would be the following scenario:
Two or more researchers are observing a high school classroom. The
class is discussing a movie that they have just viewed as a group.
The researchers have a sliding rating scale (1 being most positive,
5 being most negative) with which they are rating the student's
oral responses. Interrater reliability assesses the consistency of
how the rating system is implemented. For example, if one
researcher gives a "1" to a student response, while another
researcher gives a "5," obviously the interrater reliability would
be inconsistent. Interrater reliability is dependent upon the
ability of two or more individuals to be consistent. Training,
education and monitoring skills can enhance interrater
reliability.
(Extracted
fromhttp://writing.colostate.edu/index.cfm)
加载中,请稍候......