加载中…
个人资料
  • 博客等级:
  • 博客积分:
  • 博客访问:
  • 关注人气:
  • 获赠金笔:0支
  • 赠出金笔:0支
  • 荣誉徽章:
正文 字体大小:

k-折交叉验证(K-fold cross-validation)

(2010-07-06 11:28:53)
标签:

杂谈

   k-折交叉验证(K-fold cross-validation)是指将样本集分为k份,其中k-1份作为训练数据集,而另外的1份作为验证数据集。用验证集来验证所得分类器或者回归的错误码率。一般需要循环k次,直到所有k份数据全部被选择一遍为止。

Cross validation is a model evaluation method that is better than residuals. The problem with residual evaluations is that they do not give an indication of how well the learner will do when it is asked to make new predictions for data it has not already seen. One way to overcome this problem is to not use the entire data set when training a learner. Some of the data is removed before training begins. Then when training is done, the data that was removed can be used to test the performance of the learned model on ``new'' data. This is the basic idea for a whole class of model evaluation methods called cross validation.

The holdout method is the simplest kind of cross validation. The data set is separated into two sets, called the training set and the testing set. The function approximator fits a function using the training set only. Then the function approximator is asked to predict the output values for the data in the testing set (it has never seen these output values before). The errors it makes are accumulated as before to give the mean absolute test set error, which is used to evaluate the model. The advantage of this method is that it is usually preferable to the residual method and takes no longer to compute. However, its evaluation can have a high variance. The evaluation may depend heavily on which data points end up in the training set and which end up in the test set, and thus the evaluation may be significantly different depending on how the division is made.

K-fold cross validation is one way to improve over the holdout method. The data set is divided into k subsets, and the holdout method is repeated k times. Each time, one of the k subsets is used as the test set and the other k-1 subsets are put together to form a training set. Then the average error across all k trials is computed. The advantage of this method is that it matters less how the data gets divided. Every data point gets to be in a test set exactly once, and gets to be in a training set k-1 times. The variance of the resulting estimate is reduced as k is increased. The disadvantage of this method is that the training algorithm has to be rerun from scratch k times, which means it takes k times as much computation to make an evaluation. A variant of this method is to randomly divide the data into a test and training set k different times. The advantage of doing this is that you can independently choose how large each test set is and how many trials you average over.

Leave-one-out cross validation is K-fold cross validation taken to its logical extreme, with K equal to N, the number of data points in the set. That means that N separate times, the function approximator is trained on all the data except for one point and a prediction is made for that point. As before the average error is computed and used to evaluate the model. The evaluation given by leave-one-out cross validation error (LOO-XVE) is good, but at first pass it seems very expensive to compute. Fortunately, locally weighted learners can make LOO predictions just as easily as they make regular predictions. That means computing the LOO-XVE takes no more time than computing the residual error and it is a much better way to evaluate models. We will see shortly that Vizier relies heavily on LOO-XVE to choose its metacodes.

http://www.amuhouse.com/blog/attachments/month_0712/j20071212201230.gif

    

Figure 26: Cross validation checks how well a model generalizes to new data

Fig. 26 shows an example of cross validation performing better than residual error. The data set in the top two graphs is a simple underlying function with significant noise. Cross validation tells us that broad smoothing is best. The data set in the bottom two graphs is a complex underlying function with no noise. Cross validation tells us that very little smoothing is best for this data set.

Now we return to the question of choosing a good metacode for data set a1.mbl:

File -> Open -> a1.mbl

Edit -> Metacode -> A90:9

Model -> LOOPredict

Edit -> Metacode -> L90:9

Model -> LOOPredict

Edit -> Metacode -> L10:9

Model -> LOOPredict

LOOPredict goes through the entire data set and makes LOO predictions for each point. At the bottom of the page it shows the summary statistics including Mean LOO error, RMS LOO error, and information about the data point with the largest error. The mean absolute LOO-XVEs for the three metacodes given above (the same three used to generate the graphs in fig. 25), are 2.98, 1.23, and 1.80. Those values show that global linear regression is the best metacode of those three, which agrees with our intuitive feeling from looking at the plots in fig. 25. If you repeat the above operation on data set b1.mbl you'll get the values 4.83, 4.45, and 0.39, which also agrees with our observations.

 

What are cross-validation and bootstrapping?

--------------------------------------------------------------------------------

Cross-validation and bootstrapping are both methods for estimating

generalization error based on "resampling" (Weiss and Kulikowski 1991; Efron

and Tibshirani 1993; Hjorth 1994; Plutowski, Sakata, and White 1994; Shao

and Tu 1995). The resulting estimates of generalization error are often used

for choosing among various models, such as different network architectures.

Cross-validation

++++++++++++++++

In k-fold cross-validation, you divide the data into k subsets of

(approximately) equal size. You train the net k times, each time leaving

out one of the subsets from training, but using only the omitted subset to

compute whatever error criterion interests you. If k equals the sample

size, this is called "leave-one-out" cross-validation. "Leave-v-out" is a

more elaborate and expensive version of cross-validation that involves

leaving out all possible subsets of v cases.

Note that cross-validation is quite different from the "split-sample" or

"hold-out" method that is commonly used for early stopping in NNs. In the

split-sample method, only a single subset (the validation set) is used to

estimate the generalization error, instead of k different subsets; i.e.,

there is no "crossing". While various people have suggested that

cross-validation be applied to early stopping, the proper way of doing so is

not obvious.

The distinction between cross-validation and split-sample validation is

extremely important because cross-validation is markedly superior for small

data sets; this fact is demonstrated dramatically by Goutte (1997) in a

reply to Zhu and Rohwer (1996). For an insightful discussion of the

limitations of cross-validatory choice among several learning methods, see

Stone (1977).

Jackknifing

+++++++++++

Leave-one-out cross-validation is also easily confused with jackknifing.

Both involve omitting each training case in turn and retraining the network

on the remaining subset. But cross-validation is used to estimate

generalization error, while the jackknife is used to estimate the bias of a

statistic. In the jackknife, you compute some statistic of interest in each

subset of the data. The average of these subset statistics is compared with

the corresponding statistic computed from the entire sample in order to

estimate the bias of the latter. You can also get a jackknife estimate of

the standard error of a statistic. Jackknifing can be used to estimate the

bias of the training error and hence to estimate the generalization error,

but this process is more complicated than leave-one-out cross-validation

(Efron, 1982; Ripley, 1996, p. 73).

Choice of cross-validation method

+++++++++++++++++++++++++++++++++

Cross-validation can be used simply to estimate the generalization error of a given model, or it can be used for model selection by choosing one of several models that has the smallest estimated generalization error. For example, you might use cross-validation to choose the number of hidden units, or you could use cross-alidation to choose a subset of the inputs (subset selection). A subset that contains all relevant inputs will be called a "good" subsets, while the subset that contains all relevant inputs but no others will be called the "best" subset. Note that subsets are "good" and "best" in an asymptotic sense (as the number of training cases goes to infinity). With a small training set, it is possible that a subset that is smaller than the "best" subset may provide better generalization error.

Leave-one-out cross-validation often works well for estimating generalization error for continuous error functions such as the mean squared error, but it may perform poorly for discontinuous error functions such as the number of misclassified cases. In the latter case, k-fold cross-validation is preferred. But if k gets too small, the error estimate is pessimistically biased because of the difference in training-set size between the full-sample analysis and the cross-validation analyses. (For model-selection purposes, this bias can actually help; see the discussion below of Shao, 1993.) A value of 10 for k is popular for estimating generalization error.

Leave-one-out cross-validation can also run into trouble with various

model-selection methods. Again, one problem is lack of continuity--a small

change in the data can cause a large change in the model selected (Breiman,

1996). For choosing subsets of inputs in linear regression, Breiman and

Spector (1992) found 10-fold and 5-fold cross-validation to work better than

leave-one-out. Kohavi (1995) also obtained good results for 10-fold

cross-validation with empirical decision trees (C4.5). Values of k as small

as 5 or even 2 may work even better if you analyze several different random

k-way splits of the data to reduce the variability of the cross-validation

estimate.

Leave-one-out cross-validation also has more subtle deficiencies for model

selection. Shao (1995) showed that in linear models, leave-one-out

cross-validation is asymptotically equivalent to AIC (and Mallows' C_p), but

leave-v-out cross-validation is asymptotically equivalent to Schwarz's

Bayesian criterion (called SBC or BIC) when v =

n[1-1/(log(n)-1)], where n is the number of training cases. SBC

provides consistent subset-selection, while AIC does not. That is, SBC will

choose the "best" subset with probability approaching one as the size of the

training set goes to infinity. AIC has an asymptotic probability of one of

choosing a "good" subset, but less than one of choosing the "best" subset

(Stone, 1979). Many simulation studies have also found that AIC overfits

badly in small samples, and that SBC works well (e.g., Hurvich and Tsai,

1989; Shao and Tu, 1995). Hence, these results suggest that leave-one-out

cross-validation should overfit in small samples, but leave-v-out

cross-validation with appropriate v should do better. However, when true

models have an infinite number of parameters, SBC is not efficient, and

other criteria that are asymptotically efficient but not consistent for

model selection may produce better generalization (Hurvich and Tsai, 1989).

Shao (1993) obtained the surprising result that for selecting subsets of

inputs in a linear regression, the probability of selecting the "best" does

not converge to 1 (as the sample size n goes to infinity) for leave-v-out

cross-validation unless the proportion v/n approaches 1. At first glance,

Shao's result seems inconsistent with the analysis by Kearns (1997) of

split-sample validation, which shows that the best generalization is

obtained with v/n strictly between 0 and 1, with little sensitivity to the

precise value of v/n for large data sets. But the apparent conflict is due

to the fundamentally different properties of cross-validation and

split-sample validation.

To obtain an intuitive understanding of Shao (1993), let's review some

background material on generalization error. Generalization error can be

broken down into three additive parts, noise variance + estimation variance

+ squared estimation bias. Noise variance is the same for all subsets of

inputs. Bias is nonzero for subsets that are not "good", but it's zero for

all "good" subsets, since we are assuming that the function to be learned is

linear. Hence the generalization error of "good" subsets will differ only in

the estimation variance. The estimation variance is (2p/t)s^2 where p

is the number of inputs in the subset, t is the training set size, and s^2

is the noise variance. The "best" subset is better than other "good" subsets

only because the "best" subset has (by definition) the smallest value of p.

But the t in the denominator means that differences in generalization error

among the "good" subsets will all go to zero as t goes to infinity.

Therefore it is difficult to guess which subset is "best" based on the

generalization error even when t is very large. It is well known that

unbiased estimates of the generalization error, such as those based on AIC,

FPE, and C_p, do not produce consistent estimates of the "best" subset

(e.g., see Stone, 1979).

In leave-v-out cross-validation, t=n-v. The differences of the

cross-validation estimates of generalization error among the "good" subsets

contain a factor 1/t, not 1/n. Therefore by making t small enough (and

thereby making each regression based on t cases bad enough), we can make

the differences of the cross-validation estimates large enough to detect. It

turns out that to make t small enough to guess the "best" subset

consistently, we have to have t/n go to 0 as n goes to infinity.

The crucial distinction between cross-validation and split-sample validation

is that with cross-validation, after guessing the "best" subset, we train

the linear regression model for that subset using all n cases, but with

split-sample validation, only t cases are ever used for training. If our

main purpose were really to choose the "best" subset, I suspect we would

still have to have t/n go to 0 even for split-sample validation. But

choosing the "best" subset is not the same thing as getting the best

generalization. If we are more interested in getting good generalization

than in choosing the "best" subset, we do not want to make our regression

estimate based on only t cases as bad as we do in cross-validation, because

in split-sample validation that bad regression estimate is what we're stuck

with. So there is no conflict between Shao and Kearns, but there is a

conflict between the two goals of choosing the "best" subset and getting the

best generalization in split-sample validation.

Bootstrapping

+++++++++++++

Bootstrapping seems to work better than cross-validation in many cases

(Efron, 1983). In the simplest form of bootstrapping, instead of repeatedly

analyzing subsets of the data, you repeatedly analyze subsamples of the

data. Each subsample is a random sample with replacement from the full

sample. Depending on what you want to do, anywhere from 50 to 2000

subsamples might be used. There are many more sophisticated bootstrap

methods that can be used not only for estimating generalization error but

also for estimating confidence bounds for network outputs (Efron and

Tibshirani 1993). For estimating generalization error in classification

problems, the .632+ bootstrap (an improvement on the popular .632 bootstrap)

is one of the currently favored methods that has the advantage of performing

well even when there is severe overfitting. Use of bootstrapping for NNs is

described in Baxt and White (1995), Tibshirani (1996), and Masters (1995).

However, the results obtained so far are not very thorough, and it is known

that bootstrapping does not work well for some other methodologies such as

empirical decision trees (Breiman, Friedman, Olshen, and Stone, 1984;

Kohavi, 1995), for which it can be excessively optimistic.

For further information

+++++++++++++++++++++++

Cross-validation and bootstrapping become considerably more complicated for

time series data; see Hjorth (1994) and Snijders (1988).

More information on jackknife and bootstrap confidence intervals is

available at ftp://ftp.sas.com/pub/neural/jackboot.sas (this is a plain-text

file).

References: 

   Baxt, W.G. and White, H. (1995) "Bootstrapping confidence intervals for

   clinical input variable effects in a network trained to identify the

   presence of acute myocardial infarction", Neural Computation, 7, 624-638.

   Breiman, L., and Spector, P. (1992), "Submodel selection and evaluation

   in regression: The X-random case," International Statistical Review, 60,

   291-319. 

   Dijkstra, T.K., ed. (1988), On Model Uncertainty and Its Statistical

   Implications, Proceedings of a workshop held in Groningen, The

   Netherlands, September 25-26, 1986, Berlin: Springer-Verlag. 

   Efron, B. (1982) The Jackknife, the Bootstrap and Other Resampling

   Plans, Philadelphia: SIAM. 

   Efron, B. (1983), "Estimating the error rate of a prediction rule:

   Improvement on cross-validation," J. of the American Statistical

   Association, 78, 316-331. 

   Efron, B. and Tibshirani, R.J. (1993), An Introduction to the Bootstrap,

   London: Chapman & Hall. 

   Efron, B. and Tibshirani, R.J. (1997), "Improvements on cross-validation:

   The .632+ bootstrap method," J. of the American Statistical Association,

   92, 548-560. 

   Goutte, C. (1997), "Note on free lunches and cross-validation," Neural

   Computation, 9, 1211-1215,

   ftp://eivind.imm.dtu.dk/dist/1997/goutte.nflcv.ps.gz. 

   Hjorth, J.S.U. (1994), Computer Intensive Statistical Methods Validation,

   Model Selection, and Bootstrap, London: Chapman & Hall. 

   Hurvich, C.M., and Tsai, C.-L. (1989), "Regression and time series model

   selection in small samples," Biometrika, 76, 297-307. 

   Kearns, M. (1997), "A bound on the error of cross validation using the

   approximation and estimation rates, with consequences for the

   training-test split," Neural Computation, 9, 1143-1161. 

   Kohavi, R. (1995), "A study of cross-validation and bootstrap for

   accuracy estimation and model selection," International Joint Conference

   on Artificial Intelligence (IJCAI), pp. ?,

   http://robotics.stanford.edu/users/ronnyk/ 

   Masters, T. (1995) Advanced Algorithms for Neural Networks: A C++

   Sourcebook, NY: John Wiley and Sons, ISBN 0-471-10588-0 

   Plutowski, M., Sakata, S., and White, H. (1994), "Cross-validation

   estimates IMSE," in Cowan, J.D., Tesauro, G., and Alspector, J. (eds.)

   Advances in Neural Information Processing Systems 6, San Mateo, CA:

   Morgan Kaufman, pp. 391-398. 

   Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge:

   Cambridge University Press. 

   Shao, J. (1993), "Linear model selection by cross-validation," J. of the

   American Statistical Association, 88, 486-494. 

   Shao, J. (1995), "An asymptotic theory for linear model selection,"

   Statistica Sinica ?. 

   Shao, J. and Tu, D. (1995), The Jackknife and Bootstrap, New York:

   Springer-Verlag. 

   Snijders, T.A.B. (1988), "On cross-validation for predictor evaluation in

   time series," in Dijkstra (1988), pp. 56-69. 

   Stone, M. (1977), "Asymptotics for and against cross-validation,"

   Biometrika, 64, 29-35. 

   Stone, M. (1979), "Comments on model selection criteria of Akaike and

   Schwarz," J. of the Royal Statistical Society, Series B, 41, 276-278. 

   Tibshirani, R. (1996), "A comparison of some error estimates for neural

   network models," Neural Computation, 8, 152-163. 

   Weiss, S.M. and Kulikowski, C.A. (1991), Computer Systems That Learn,

   Morgan Kaufmann. 

   Zhu, H., and Rohwer, R. (1996), "No free lunch for cross-validation,"

   Neural Computation, 8, 1421-1426.

0

阅读 收藏 喜欢 打印举报/Report
  

新浪BLOG意见反馈留言板 欢迎批评指正

新浪简介 | About Sina | 广告服务 | 联系我们 | 招聘信息 | 网站律师 | SINA English | 产品答疑

新浪公司 版权所有