Reliability and Validity之Reliability
(2018-02-23 00:28:33)
标签:
reliabilitytest-retestinternalconsistencyinter-rater |
分类: Research |
When we are evaluating a measurement method, we consider two
general dimensions: reliability
and validity.
Reliability
Reliability refers to the consistency of a measure.
Psychologists consider three types of consistency: over time
(test-retest reliability), across items (internal consistency), and
across different researchers (inter-rater reliability).
1. Test-Retest Reliability
When researchers measure a construct that they assume to be
consistent across time, then the scores they obtain should also be
consistent across time. Test-retest reliability is the
extent to which this is actually the case. For example,
intelligence is generally thought to be consistent across time. A
person who is highly intelligent today will be highly intelligent
next week. This means that any good measure of intelligence should
produce roughly the same scores for this individual next week as it
does today. Clearly, a measure that produces highly inconsistent
scores over time cannot be a very good measure of a construct that
is supposed to be consistent.
Assessing test-retest reliability requires using the measure
on a group of people at one time, using it again on the same group
of people at a later time, and then looking at test-retest
correlation between the two sets of scores. This is typically
done by graphing the data in a scatterplot and computing
Pearson's r.
Again, high test-retest correlations make sense when the
construct being measured is assumed to be consistent over time,
which is the case for intelligence, self-esteem, and the Big Five
personality dimensions. But other
constructs are not assumed to be stable over time. The very nature
of mood, for example, is that it changes. So a measure
of mood that produced a low test-retest correlation over a period
of a month would not be a cause for concern.
2. Internal Consistency
A second kind of reliability is internal consistency,
which is the consistency of people's responses across the items on
a multiple-item measure. In general, all the items on such measures
are supposed to reflect the same underlying construct, so people’s
scores on those items should be correlated with each other. On the
Rosenberg Self-Esteem Scale, people who agree that they are a
person of worth should tend to agree that they have a number of
good qualities. If people’s responses to the different items are
not correlated with each other, then it would no longer make sense
to claim that they are all measuring the same underlying construct.
This is as true for behavioral and physiological measures as for
self-report measures. For example, people might make a series of
bets in a simulated game of roulette as a measure of their level of
risk-seeking. This measure would be internally consistent to the
extent that individual participants’ bets were consistently high or
low across trials.
Like test-retest reliability, internal consistency can only
be assessed by collecting and analyzing data. One approach is
to look at a split-half correlation. This involves splitting the
items into two sets, such as the first and second halves of the
items or the even- and odd-numbered items. Then a score is computed
for each set of items, and the relationship between the two sets of
scores is examined. A split-half correlation of +.80 or greater is
generally considered good internal consistency.
Perhaps the most common measure of internal consistency used
by researchers in psychology is a statistic called Cronbach's
α (the Greek letter alpha). Conceptually, α is the mean of
all possible split-half correlations for a set of items. For
example, there are 252 ways to split a set of 10 items into two
sets of five. Cronbach's α would be the mean of the 252 split-half
correlations. Note that this is not how α is actually computed, but
it is a correct way of interpreting the meaning of this statistic.
Again, a value of +.80 or greater is generally taken to indicate
good internal consistency.
3. Inter-rater Reliability
Inter-rater reliability is the extent to which
different observers are consistent in their judgments. For example,
if you were interested in measuring university students' social
skills, you could make video recordings of them as they interacted
with another student whom they are meeting for the first time. Then
you could have two or more observers watch the videos and rate each
student's level of social skills. To the extent that each
participant does, in fact, have some level of social skills that
can be detected by an attentive observer, different observers'
ratings should be highly correlated with each other.
Inter-rater reliability would also have been measured in Bandura's
Bobo doll study. In this case, the observers' ratings of how many
acts of aggression a particular child committed while playing with
the Bobo doll should have been highly positively correlated.
Interrater reliability is often assessed using Cronbach's α when
the judgments are quantitative or an analogous statistic called
Cohen's κ (the Greek letter kappa) when they are categorical.