Inter-rater reliability(评分者间信度)_小徐老徐XDU

http://blog.sina.com.cn/u/1910166511

首页博文目录关于我

个人资料

微博

加好友发纸条

写留言加关注

博客等级：
博客积分：

博客访问：
关注人气：
获赠金笔：0支
赠出金笔：0支
荣誉徽章：

正文字体大小：大中小

Inter-rater reliability(评分者间信度)

(2011-08-20 13:37:10)

标签：

rater

reliability

教育

分类： DB-BD

Definition:

In statistics, inter-rater reliability, inter-rater agreement, or concordance is the degree of agreement among raters. It gives a score of how much homogeneity, or consensus, there is in the ratings given by judges. It is useful in refining the tools given to human judges, for example by determining if a particular scale is appropriate for measuring a particular variable. If various raters do not agree, either the scale is defective or the raters need to be re-trained.

There are a number of statistics which can be used to determine inter-rater reliability. Different statistics are appropriate for different types of measurement. Some options are: joint-probability of agreement, Cohen's kappa and the related Fleiss' kappa, inter-rater correlation, concordance correlation coefficient and intra-class correlation.

Different raters can disagree about measurement results from the same object by e.g. variations in the procedures of carrying out the experiment, interpreting the results and, subsequently, presenting them. All these stages may be affected by experimenter's bias, that is, a tendency to deviate towards what is expected by the rater. When interpreting and presenting the results, there may be inter-rater variations in digit preference, that is, preferences differ whether to round off a value to a lower one or a higher one.

There are several operational definitions of "inter-rater reliability" in use by Examination Boards, reflecting different viewpoints about what is reliable agreement between raters.

There are three operational definitions of agreement:

1. Reliable raters agree with the "official" rating of a performance.

2. Reliable raters agree with each other.

3. Reliable raters agree about which performance is better and which is worse.

These combine with two operational definitions of behavior:

A. Reliable raters are automatons, behaving like "rating machines". This category includes rating of essays by computer. This behavior can be evaluated by Generalizability theory.

B. Reliable raters behave like independent witnesses. They demonstrate their independence by disagreeing slightly. This behavior can be evaluated by the Rasch model.

Joint probability of agreement:

The joint-probability of agreement is probably the most simple and least robust measure. It is the number of times each rating (e.g. 1, 2, ... 5) is assigned by each rater divided by the total number of ratings. It assumes that the data are entirely nominal. It does not take into account that agreement may happen solely based on chance. Some question, though, whether there is a need to 'correct' for chance agreement; and suggest that, in any case, any such adjustment should be based on an explicit model of how chance and error affect raters' decisions.

When the number of categories being used is small (e.g. 2 or 3), the likelihood for 2 raters to agree by pure chance increases dramatically. This is because both raters must confine themselves to the limited number of options available, which impacts the overall agreement rate, and not necessarily their propensity for "intrinsic" agreement (is considered "intrinsic" agreement, an agreement not due to chance). Therefore, the joint probability of agreement will remain high even in the absence of any "intrinsic" agreement among raters. A useful inter-rater reliability coefficient is expected to (1) be close to 0, when there is no "intrinsic" agreement, and (2) to increase as the "intrinsic" agreement rate improves. Most chance-corrected agreement coefficients achieve the first objective. However, the second objective is not achieved by many known chance-corrected measures.

(Extracted from Wikipedia,http://en.wikipedia.org/wiki/Inter-rater_agreement, just for learning)

其实我觉得这个说得更好理解一些：http://www.douban.com/group/topic/6303356/

下面这个也行：

Interrater reliability is the extent to which two or more individuals (coders or raters) agree. Interrater reliability addresses the consistency of the implementation of a rating system.

A test of interrater reliability would be the following scenario: Two or more researchers are observing a high school classroom. The class is discussing a movie that they have just viewed as a group. The researchers have a sliding rating scale (1 being most positive, 5 being most negative) with which they are rating the student's oral responses. Interrater reliability assesses the consistency of how the rating system is implemented. For example, if one researcher gives a "1" to a student response, while another researcher gives a "5," obviously the interrater reliability would be inconsistent. Interrater reliability is dependent upon the ability of two or more individuals to be consistent. Training, education and monitoring skills can enhance interrater reliability.

（Extracted fromhttp://writing.colostate.edu/index.cfm）

阅读┊ 收藏 ┊ 喜欢 ▼ ┊打印┊举报/Report

前一篇：[转载]数据挖掘标准流程——CRISP-DM

后一篇：Cohen's kappa

新浪BLOG意见反馈留言板　欢迎批评指正