k-折交叉验证(K-fold
cross-validation)是指将样本集分为k份,其中k-1份作为训练数据集,而另外的1份作为验证数据集。用验证集来验证所得分类器或者回归的错误码率。一般需要循环k次,直到所有k份数据全部被选择一遍为止。
Cross validation is a model evaluation
method that is better than residuals. The problem with residual
evaluations is that they do not give an indication of how well the
learner will do when it is asked to make new predictions for
data it has not already seen. One way to overcome
this problem is to not use the entire data set when training
a learner. Some of the data is removed before training
begins. Then when training is done, the data that was
removed can be used to test the performance of the learned model on
``new'' data. This is the basic idea for a whole class of
model evaluation methods called cross validation.
The holdout method is the simplest kind
of cross validation. The data set is separated into two
sets, called the training set and the testing set. The function
approximator fits a function using the training set only.
Then the function approximator is asked to predict the output
values for the data in the testing set (it has never seen
these output values before). The errors it makes are accumulated as
before to give the mean absolute test set error, which is used to
evaluate the model. The advantage of this method is that it is
usually preferable to the residual method and takes no longer to
compute. However, its evaluation can have a high variance. The
evaluation may depend heavily on which data points end up in
the training set and which end up in the test set, and thus the
evaluation may be significantly different depending on how the
division is made.
K-fold cross validation is one
way to improve over the holdout method. The data set is
divided into k subsets, and the holdout method is repeated k times.
Each time, one of the k subsets is used as the test set and
the other k-1 subsets are put together to form a training set. Then
the average error across all k trials is computed. The advantage of
this method is that it matters less how the data gets
divided. Every data point gets to be in a test set exactly
once, and gets to be in a training set k-1 times. The
variance of the resulting estimate is reduced as k is increased.
The disadvantage of this method is that the training algorithm has
to be rerun from scratch k times, which means it takes k times as
much computation to make an evaluation. A variant of this method is
to randomly divide the data into a test and training set k
different times. The advantage of doing this is that you can
independently choose how large each test set is and how many trials
you average over.
Leave-one-out cross validation
is K-fold cross validation taken to its logical extreme, with K
equal to N, the number of data points in the set. That means
that N separate times, the function approximator is trained on all
the data except for one point and a prediction is
made for that point. As before the average error is computed and
used to evaluate the model. The evaluation given by
leave-one-out cross validation error (LOO-XVE) is good, but
at first pass it seems very expensive to compute. Fortunately,
locally weighted learners can make LOO predictions just as easily
as they make regular predictions. That means computing the LOO-XVE
takes no more time than computing the residual error and it is a
much better way to evaluate models. We will see shortly that Vizier
relies heavily on LOO-XVE to choose its metacodes.
http://www.amuhouse.com/blog/attachments/month_0712/j20071212201230.gif
Figure 26: Cross validation checks how
well a model generalizes to new data
Fig. 26 shows an example of cross
validation performing better than residual error. The data
set in the top two graphs is a simple underlying function with
significant noise. Cross validation tells us that broad smoothing
is best. The data set in the bottom two graphs is a complex
underlying function with no noise. Cross validation tells us that
very little smoothing is best for this data set.
Now we return to the question of
choosing a good metacode for data set a1.mbl:
File -> Open
-> a1.mbl
Edit -> Metacode
-> A90:9
Model -> LOOPredict
Edit -> Metacode
-> L90:9
Model -> LOOPredict
Edit -> Metacode
-> L10:9
Model -> LOOPredict
LOOPredict goes through the entire
data set and makes LOO predictions for each point. At the
bottom of the page it shows the summary statistics including Mean
LOO error, RMS LOO error, and information about the data
point with the largest error. The mean absolute LOO-XVEs for the
three metacodes given above (the same three used to generate the
graphs in fig. 25), are 2.98, 1.23, and 1.80. Those values show
that global linear regression is the best metacode of those three,
which agrees with our intuitive feeling from looking at the plots
in fig. 25. If you repeat the above operation on data set
b1.mbl you'll get the values 4.83, 4.45, and 0.39, which also
agrees with our observations.
What are cross-validation and
bootstrapping?
--------------------------------------------------------------------------------
Cross-validation and bootstrapping are
both methods for estimating
generalization error based on
"resampling" (Weiss and Kulikowski 1991; Efron
and Tibshirani 1993; Hjorth 1994;
Plutowski, Sakata, and White 1994; Shao
and Tu 1995). The resulting estimates
of generalization error are often used
for choosing among various models, such
as different network architectures.
Cross-validation
++++++++++++++++
In k-fold cross-validation, you divide
the data into k subsets of
(approximately) equal size. You train
the net k times, each time leaving
out one of the subsets from
training, but using only the omitted subset to
compute whatever error criterion
interests you. If k equals the sample
size, this is called
"leave-one-out" cross-validation. "Leave-v-out" is a
more elaborate and expensive version of
cross-validation that involves
leaving out all possible subsets of v
cases.
Note that cross-validation is quite
different from the "split-sample" or
"hold-out" method that is commonly used
for early stopping in NNs. In the
split-sample method, only a
single subset (the validation set) is used to
estimate the generalization error,
instead of k different subsets; i.e.,
there is no "crossing". While various
people have suggested that
cross-validation be applied to early
stopping, the proper way of doing so is
not obvious.
The distinction between
cross-validation and split-sample validation is
extremely important because
cross-validation is markedly superior for small
data sets; this fact is
demonstrated dramatically by Goutte (1997) in a
reply to Zhu and Rohwer (1996). For an
insightful discussion of the
limitations of cross-validatory choice
among several learning methods, see
Stone (1977).
Jackknifing
+++++++++++
Leave-one-out cross-validation
is also easily confused with jackknifing.
Both involve omitting each training
case in turn and retraining the network
on the remaining subset. But
cross-validation is used to estimate
generalization error, while the
jackknife is used to estimate the bias of a
statistic. In the jackknife, you
compute some statistic of interest in each
subset of the data. The average
of these subset statistics is compared with
the corresponding statistic computed
from the entire sample in order to
estimate the bias of the latter. You
can also get a jackknife estimate of
the standard error of a statistic.
Jackknifing can be used to estimate the
bias of the training error and hence to
estimate the generalization error,
but this process is more complicated
than leave-one-out cross-validation
(Efron, 1982; Ripley, 1996, p. 73).
Choice of cross-validation method
+++++++++++++++++++++++++++++++++
Cross-validation can be used simply to
estimate the generalization error of a given model, or it can be
used for model selection by choosing one of several models
that has the smallest estimated generalization error. For example,
you might use cross-validation to choose the number of hidden
units, or you could use cross-alidation to choose a subset of the
inputs (subset selection). A subset that contains all relevant
inputs will be called a "good" subsets, while the
subset that contains all relevant inputs but no others will be
called the "best" subset. Note that subsets are "good" and "best"
in an asymptotic sense (as the number of training cases goes to
infinity). With a small training set, it is possible that a subset
that is smaller than the "best" subset may provide better
generalization error.
Leave-one-out cross-validation
often works well for estimating generalization error for continuous
error functions such as the mean squared error, but it may perform
poorly for discontinuous error functions such as the number of
misclassified cases. In the latter case, k-fold cross-validation is
preferred. But if k gets too small, the error estimate is
pessimistically biased because of the difference in training-set
size between the full-sample analysis and the cross-validation
analyses. (For model-selection purposes, this bias can actually
help; see the discussion below of Shao, 1993.) A value of 10 for k
is popular for estimating generalization error.
Leave-one-out cross-validation
can also run into trouble with various
model-selection methods. Again,
one problem is lack of continuity--a small
change in the data can cause a
large change in the model selected (Breiman,
1996). For choosing subsets of inputs
in linear regression, Breiman and
Spector (1992) found 10-fold and 5-fold
cross-validation to work better than
leave-one-out. Kohavi (1995)
also obtained good results for 10-fold
cross-validation with empirical
decision trees (C4.5). Values of k as small
as 5 or even 2 may work even better if
you analyze several different random
k-way splits of the data to
reduce the variability of the cross-validation
estimate.
Leave-one-out cross-validation
also has more subtle deficiencies for model
selection. Shao (1995) showed that in
linear models, leave-one-out
cross-validation is asymptotically
equivalent to AIC (and Mallows' C_p), but
leave-v-out cross-validation is
asymptotically equivalent to Schwarz's
Bayesian criterion (called SBC or BIC)
when v =
n[1-1/(log(n)-1)], where n is the
number of training cases. SBC
provides consistent subset-selection,
while AIC does not. That is, SBC will
choose the "best" subset with
probability approaching one as the size of the
training set goes to infinity. AIC has
an asymptotic probability of one of
choosing a "good" subset, but less than
one of choosing the "best" subset
(Stone, 1979). Many simulation studies
have also found that AIC overfits
badly in small samples, and that SBC
works well (e.g., Hurvich and Tsai,
1989; Shao and Tu, 1995). Hence, these
results suggest that leave-one-out
cross-validation should overfit in
small samples, but leave-v-out
cross-validation with appropriate v
should do better. However, when true
models have an infinite number of
parameters, SBC is not efficient, and
other criteria that are asymptotically
efficient but not consistent for
model selection may produce better
generalization (Hurvich and Tsai, 1989).
Shao (1993) obtained the surprising
result that for selecting subsets of
inputs in a linear regression, the
probability of selecting the "best" does
not converge to 1 (as the sample size n
goes to infinity) for leave-v-out
cross-validation unless the proportion
v/n approaches 1. At first glance,
Shao's result seems inconsistent with
the analysis by Kearns (1997) of
split-sample validation, which shows
that the best generalization is
obtained with v/n strictly between 0
and 1, with little sensitivity to the
precise value of v/n for large
data sets. But the apparent conflict is due
to the fundamentally different
properties of cross-validation and
split-sample validation.
To obtain an intuitive understanding of
Shao (1993), let's review some
background material on generalization
error. Generalization error can be
broken down into three additive parts,
noise variance + estimation variance
+ squared estimation bias. Noise
variance is the same for all subsets of
inputs. Bias is nonzero for subsets
that are not "good", but it's zero for
all "good" subsets, since we are
assuming that the function to be learned is
linear. Hence the generalization error
of "good" subsets will differ only in
the estimation variance. The estimation
variance is (2p/t)s^2 where p
is the number of inputs in the subset,
t is the training set size, and s^2
is the noise variance. The "best"
subset is better than other "good" subsets
only because the "best" subset
has (by definition) the smallest value of p.
But the t in the denominator means that
differences in generalization error
among the "good" subsets will all go to
zero as t goes to infinity.
Therefore it is difficult to guess
which subset is "best" based on the
generalization error even when t is
very large. It is well known that
unbiased estimates of the
generalization error, such as those based on AIC,
FPE, and C_p, do not produce consistent
estimates of the "best" subset
(e.g., see Stone, 1979).
In leave-v-out cross-validation, t=n-v.
The differences of the
cross-validation estimates of
generalization error among the "good" subsets
contain a factor 1/t, not 1/n.
Therefore by making t small enough (and
thereby making each regression based on
t cases bad enough), we can make
the differences of the cross-validation
estimates large enough to detect. It
turns out that to make t small enough
to guess the "best" subset
consistently, we have to have t/n go to
0 as n goes to infinity.
The crucial distinction between
cross-validation and split-sample validation
is that with cross-validation, after
guessing the "best" subset, we train
the linear regression model for that
subset using all n cases, but with
split-sample validation, only t
cases are ever used for training. If our
main purpose were really to choose the
"best" subset, I suspect we would
still have to have t/n go to 0 even for
split-sample validation. But
choosing the "best" subset is not the
same thing as getting the best
generalization. If we are more
interested in getting good generalization
than in choosing the "best" subset, we
do not want to make our regression
estimate based on only t cases
as bad as we do in cross-validation, because
in split-sample validation that bad
regression estimate is what we're stuck
with. So there is no conflict between
Shao and Kearns, but there is a
conflict between the two goals of
choosing the "best" subset and getting the
best generalization in split-sample
validation.
Bootstrapping
+++++++++++++
Bootstrapping seems to work better than
cross-validation in many cases
(Efron, 1983). In the simplest form of
bootstrapping, instead of repeatedly
analyzing subsets of the data,
you repeatedly analyze subsamples of the
data. Each subsample is a random
sample with replacement from the full
sample. Depending on what you want to
do, anywhere from 50 to 2000
subsamples might be used. There are
many more sophisticated bootstrap
methods that can be used not
only for estimating generalization error but
also for estimating confidence bounds
for network outputs (Efron and
Tibshirani 1993). For estimating
generalization error in classification
problems, the .632+ bootstrap (an
improvement on the popular .632 bootstrap)
is one of the currently favored
methods that has the advantage of performing
well even when there is severe
overfitting. Use of bootstrapping for NNs is
described in Baxt and White (1995),
Tibshirani (1996), and Masters (1995).
However, the results obtained so far
are not very thorough, and it is known
that bootstrapping does not work well
for some other methodologies such as
empirical decision trees (Breiman,
Friedman, Olshen, and Stone, 1984;
Kohavi, 1995), for which it can be
excessively optimistic.
For further information
+++++++++++++++++++++++
Cross-validation and bootstrapping
become considerably more complicated for
time series data; see Hjorth
(1994) and Snijders (1988).
More information on jackknife and
bootstrap confidence intervals is
available at ftp://ftp.sas.com/pub/neural/jackboot.sas
(this is a plain-text
file).
References:
Baxt, W.G. and
White, H. (1995) "Bootstrapping confidence intervals for
clinical input
variable effects in a network trained to identify the
presence of
acute myocardial infarction", Neural Computation, 7, 624-638.
Breiman, L., and
Spector, P. (1992), "Submodel selection and evaluation
in regression:
The X-random case," International Statistical Review, 60,
291-319.
Dijkstra, T.K.,
ed. (1988), On Model Uncertainty and Its Statistical
Implications,
Proceedings of a workshop held in Groningen, The
Netherlands,
September 25-26, 1986, Berlin:
Springer-Verlag.
Efron, B. (1982)
The Jackknife, the Bootstrap and Other Resampling
Plans,
Philadelphia: SIAM.
Efron, B.
(1983), "Estimating the error rate of a prediction rule:
Improvement on
cross-validation," J. of the American Statistical
Association, 78,
316-331.
Efron, B. and
Tibshirani, R.J. (1993), An Introduction to the Bootstrap,
London: Chapman
& Hall.
Efron, B. and
Tibshirani, R.J. (1997), "Improvements on cross-validation:
The .632+
bootstrap method," J. of the American Statistical Association,
92,
548-560.
Goutte, C.
(1997), "Note on free lunches and cross-validation," Neural
Computation, 9,
1211-1215,
ftp://eivind.imm.dtu.dk/dist/1997/goutte.nflcv.ps.gz.
Hjorth, J.S.U.
(1994), Computer Intensive Statistical Methods Validation,
Model Selection,
and Bootstrap, London: Chapman &
Hall.
Hurvich, C.M.,
and Tsai, C.-L. (1989), "Regression and time series model
selection in
small samples," Biometrika, 76,
297-307.
Kearns, M.
(1997), "A bound on the error of cross validation using the
approximation
and estimation rates, with consequences for the
training-test
split," Neural Computation, 9,
1143-1161.
Kohavi, R.
(1995), "A study of cross-validation and bootstrap for
accuracy
estimation and model selection," International Joint Conference
on Artificial
Intelligence (IJCAI), pp. ?,
http://robotics.stanford.edu/users/ronnyk/
Masters, T.
(1995) Advanced Algorithms for Neural Networks: A C++
Sourcebook, NY:
John Wiley and Sons, ISBN
0-471-10588-0
Plutowski, M.,
Sakata, S., and White, H. (1994), "Cross-validation
estimates IMSE,"
in Cowan, J.D., Tesauro, G., and Alspector, J. (eds.)
Advances in
Neural Information Processing Systems 6, San Mateo, CA:
Morgan Kaufman,
pp. 391-398.
Ripley, B.D.
(1996) Pattern Recognition and Neural Networks, Cambridge:
Cambridge
University Press.
Shao, J. (1993),
"Linear model selection by cross-validation," J. of the
American
Statistical Association, 88, 486-494.
Shao, J. (1995),
"An asymptotic theory for linear model selection,"
Statistica
Sinica ?.
Shao, J. and Tu,
D. (1995), The Jackknife and Bootstrap, New York:
Springer-Verlag.
Snijders, T.A.B.
(1988), "On cross-validation for predictor evaluation in
time series," in
Dijkstra (1988), pp. 56-69.
Stone, M.
(1977), "Asymptotics for and against cross-validation,"
Biometrika, 64,
29-35.
Stone, M.
(1979), "Comments on model selection criteria of Akaike and
Schwarz," J. of
the Royal Statistical Society, Series B, 41,
276-278.
Tibshirani, R.
(1996), "A comparison of some error estimates for neural
network models,"
Neural Computation, 8, 152-163.
Weiss, S.M. and
Kulikowski, C.A. (1991), Computer Systems That Learn,
Morgan
Kaufmann.
Zhu, H., and
Rohwer, R. (1996), "No free lunch for cross-validation,"
Neural
Computation, 8, 1421-1426.