Machine Learning in Python (Scikit-learn)_清南听稿

http://blog.sina.com.cn/u/2104861151

首页博文目录关于我

个人资料

微博

加好友发纸条

写留言加关注

博客等级：
博客积分：

博客访问：
关注人气：
获赠金笔：0支
赠出金笔：0支
荣誉徽章：

正文字体大小：大中小

Machine Learning in Python (Scikit-learn)

(2014-07-29 15:54:41)

标签：

房产

Machine Learning in Python (Scikit-learn)

1. 闲话篇

机器学习(ML)，自然语言处理(NLP)，神马的，最近太火了。。。不知道再过几年，大家都玩儿ML，还会不会继续火下去。。。需要有人继续再添点柴火才行。本人仅仅是一个迷途小书童，知识有限，还望各位ML大神多多指点:)。
最近想系统地收拾一下ML的现有工具，发现比较好的应该是这个http://scikit-learn.org/stable/index.html。

对于初学和进阶阶段的ML研究者们是个不错的选择。不过美中不足的是少了Large-scale ML的一些，毕竟这是单机的。后面琢磨琢磨，写个ADMM(今年ICML剧多相关的论文)的吧，这个之前在MSRA的Learning Group做过一个Turtorial.

尤其是他的参考手册，更是没有太多废话，都能一针见血地讲明重点：http://scikit-learn.org/stable/user_guide.html

其实不要指望这个工具包能有啥新的东西，不过就是这些经典的东西，要是你真掌握了，也基本God Like！了。:)，特别是你用ML创业的时候，可能真能用上一两个思路，也就是被训练出来的思想估计是大学能留下来的，剩下的都在狗肚子里。

我们来大致浏览一下这个系统的ML工具的功能，整体内容较多，我们逐步更新，想具体了解哪个部分的童鞋可以留言，我一下子还真很难都详细介绍（我会基本上保证一周更新一个小章节，逐步学习。首先弄懂模型原理，讲出来，然后使用对应数据实战一下，贴出代码，作图，最后利用测试结果适当比较一下模型之间的差异），所有的代码，我都会后续贴到CSDN或者Github上面。

---------------------------------------------------华丽丽的分割线---------------------------------------------------------

2. 配置篇

推荐学习配置：python 2.7, pycharm IDE （这个Python的IDE不错，推荐大家用下，如果用过Eclipse写Java，这个上手会很快）， numpy, scipy。其他还有一些需要下载的包，大家可以边配置边有问题留言，建议在windows下面弄弄就行，我基本不用Linux。

有些小伙伴建议我也详细讲讲在windows下的配置。的确，这一系列的配置还真心没有那么简单，我特地找了一台windows7 Ultimiate SP1 x64 的裸机来重现一下整体配置过程。

首先是Python 2.7 （切记Python 3.x 和2.x的版本完全不是一路货，不存在3.x向下兼容的问题，所以，如果哪位小伙伴为了追求软件版本高而不小心安装了python 3.x，我只能说。。好吧。。你被坑了。最简单的理解，你可以认为这两个Python版本压根就不是一门相同的编程语言，就连print的语法都不同）

1. Python 2.7.x 在 x64 windows平台下的解释器。具体下载地址：https://www.python.org/download/releases/2.7.8/ 注意64位的是这个 Windows X86-64 MSI Installer (2.7.8)

测试这个Python是否在你的环境里配置好，你可以在命令行里直接输入python，如果报错，那么你需要手动配置一下环境，这个大家上网搜就可以解决（简单说，在环境变量PATH里把你的Python的安装文件夹路径写进去）。

2. 然后安装Pycharm，这个是我在Hulu实习的时候用到过的IDE，还是涛哥推荐的，还不错。因为有正版收费的问题，推荐大家下载它的(community)版 http://www.jetbrains.com/pycharm/download/。安装好后，它应该会让你选择刚才安装好的Python的解释器，这样你就可以做一些简单的python编程了，用过eclipse的人，这个上手非常快。

3. 接着就需要配置跟sklearn有关的一系列Python的扩展包了。这个美国加州一个学校的一个非官方网站张贴了所有windows直接安装的版本 http://www.lfd.uci.edu/~gohlke/pythonlibs/，特别实用，大家到里面去下载跟python 2.7 amd64有关的安装包。然后直接下载运行即可。需要下载的一系列扩展包的列表（按照依赖顺序）：Numpy-MKL, SciPy, Scikit-learn。有了这些就可以学习Scikit-learn这个工具包了。

4. 此外，如果想像我一样，同时可以画图，那么就需要matplotlib，这个也有一个网站手册http://matplotlib.org/contents.html，同样也需要一系列扩展包的支持。使用matplotlib 需要如下必备的库，numpy, dateutil, pytz, pyparsing, six。都能从刚才我推荐的下载网站上获取到。

上面的一系列都搞定了，大家可以使用我第一个线性回归的代码（加粗的代码）测试一下，直接输出图像，最后还能保存成为png格式的图片。

------------------------------华丽丽的分割线------------------------------------------

3. 数据篇

用工具之前先介绍几个我会用到的数据

这些数据都在 sklearn.datasets里面（刚接触Python的童鞋，需要一点点Python的知识，和Java类似，使用现成工具模块的时候，需要import一下，我们这个基于Python的机器学习工具包的全名是sklearn，这里介绍数据，所以下一个目录是datasets）。具体的Python代码：

import sklearn.datasets

数据一：波士顿房价（适合做回归），以后直接用boston标记
这行代码就读进来了

boston = sklearn.datasets.load_boston()

查询具体数据说明，用这个代码：

print boston.DESCR

输出如下：

Data Set Characteristics:

:Number of Instances: 506

:Number of Attributes: 13 numeric/categorical predictive

:Median Value (attribute 14) is usually the target

:Attribute Information (in order):
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's

一共506组数据，13维特征，

比如第一个维度的特征是犯罪率，第六个是每个房子平均多少房间等等。

boston.data 获取这506 * 13的特征数据

boston.target 获取对应的506 * 1的对应价格

数据二：牵牛花（适合做简单分类），标记为Iris

import sklearn.datasets

iris = sklearn.datasets.load_iris()

iris.data 获取特征

iris.target 获取对应的类别

Data Set Characteristics:
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica

这个数据基本是个ML的入门选手都知道，一共三类牵牛花，获取特征和对应的类别标签也是同上

一共150样本，3类，特征维度为4

数据三：糖尿病（回归问题），diabetes

数据四：手写数字识别（多类分类，10个类别，从0-9）digits

import sklearn.datasets

digits = sklearn.datasets.load_digits()

总体样本量：1797，每个类别大约180个样本，每个手写数字是一个8*8的图片，每个像素是0-16的整数值。

综上，大家可以加载相应的数据来玩，这几个数据算是比较有代表性的。后面会介绍如何利用SKLEARN工具下载更大规模的数据，比如MINIST的大规模的手写数字识别库等等。

总之，如果你想获取特征，就在*.data里，对应的类别或者回归值在*.target里面

光说不练不行，我对每个介绍的方法都会选用上面的Dataset实际测试一下，并且会酌情给出结果和图像。

------------------------------华丽丽的分割线------------------------------------------

4.实战篇

1. Supervised learning

这个监督学习最常用，分类啊，预测回归（预测个股票啥的，虽然在我大天朝不太适合）啊。

1.1. Generalized Linear Models

最通用的线性模型

http://scikit-learn.org/stable/_images/math/dfdf17e3ecd9ca5506b2fbf5a7ebd70412326e81.pngLearning in Python (Scikit-learn)" />

把你的特征x和对应的权重w相加，最后争取接近你的目标y，机器学的就是w。

这个模型应用最广，其实就是大家会权衡各种各样的因素，最后给一个总评。

1.1.1. Ordinary Least Squares 最小二乘约束

目标函数是这个 http://scikit-learn.org/stable/_images/math/32028e85feb455d07503a027ba607eafc7909976.pngLearning in Python (Scikit-learn)" />。

要总体的平方和最小。

具体代码大家import sklearn.linear_model，然后sklearn.linear_model.LinearRegression()就是这个模块了。做个简单的什么房价估计还行（别说预测，那个不准，只能说估计一下租房的价格，随便在搜房网上弄点儿数据，他那里有现成的特征，什么地理位置啊，面积啊，朝向啊等等，最后你回归一个大致房价玩玩）。

我们就使用波士顿的房价来预测一下（后面的所有python代码注意缩进！我是没工夫一行一行调整了。。。多包涵）：

# Draw
matplotlib.pyplot.plot(X, predict_targets, 'r--', label = 'Predict Price')
matplotlib.pyplot.plot(X, test_targets, 'g:', label='True Price')
legend = matplotlib.pyplot.legend()
matplotlib.pyplot.title("Ordinary Least Squares (Boston)")
matplotlib.pyplot.ylabel("Price")
matplotlib.pyplot.savefig("Ordinary Least Squares (Boston).png", format='png')
matplotlib.pyplot.show()

结果：

Ordinary Least Squares (Boston) Error: 3.35。基本上，平均每笔预测，都会距离真实的价格差3350美金，这个数值的单位是1000 U.S.D. （见数据描述）

下面这个图就是预测和实际价格的对比图线，这里是随机采样了50%作为训练，50%做预测，效果还行，看来这个线性模型还可以接受。

http://www.changweibo.com/ueditor/php/upload/20140729/14066218042544.jpgLearning in Python (Scikit-learn)" />

1.1.2. Ridge Regression

这个中文一般叫岭回归，就是在上面的目标函数上加个正则项，岭回归用二范数(L2 norm)。http://scikit-learn.org/stable/_images/math/11f0787a645f4b5f2b810c0d00618785b58ff574.pngLearning in Python (Scikit-learn)" />

这个范数的目的在于对整体学习到的权重都控制得比较均衡，因为我们的数据不能保证非常正常，有的时候，接近线性相关的那些噪声样本会加剧权重系数的非均衡学习，最后就是这个样子

http://www.changweibo.com/ueditor/php/upload/20140729/1406622077861.pngLearning in Python (Scikit-learn)" />

一旦某个特征噪音比较大，刚好那个权重也不小，那回归结果就惨了。

好，我们再用波士顿的房价试试岭回归。

这个地方使用RidgeCV 直接交叉验证出我需要试验的几个惩罚因子，它会帮我选择这些里面在集内测试表现最优的一个参数。后面的输出选择了0.1。

ridgeRegression.fit(train_features, train_targets)
print "Alpha = ", ridgeRegression.alpha_
# Predict
predict_targets = ridgeRegression.predict(test_features)

# Evaluation
n_test_samples = len(test_targets)
X = range(n_test_samples)
error = numpy.linalg.norm(predict_targets - test_targets, ord = 1) / n_test_samples
print "Ridge Regression (Boston) Error: %.2f" %(error)
# Draw

matplotlib.pyplot.plot(X, predict_targets, 'r--', label = 'Predict Price')
matplotlib.pyplot.plot(X, test_targets, 'g:', label='True Price')
legend = matplotlib.pyplot.legend()
matplotlib.pyplot.title("Ridge Regression (Boston)")
matplotlib.pyplot.ylabel("Price (1000 U.S.D)")
matplotlib.pyplot.savefig("Ridge Regression (Boston).png", format='png')
matplotlib.pyplot.show()

输出:

Alpha = 0.1
Ridge Regression (Boston) Error: 3.21

基本上，这样的结果，误差在3210美金左右，比之前的最一般的线性模型好一点。而且，这种情况下，基本上预测出来的图线的方差比较小，振幅略小，因为有Ridge的惩罚项的约束，保证每个特征的变化不会对整体预测有过大的影响

http://www.changweibo.com/ueditor/php/upload/20140729/14066221628912.jpgLearning in Python (Scikit-learn)" />

1.1.3. Lasso

老是听MSRA的师兄说这个，貌似还挺火的一个研究，这里面就是把二范数（L2）换成一范数（L1）。

绝对值的这个约束，更想让学习到的权重稀疏一些，压缩感知啥的跟这个有关。

http://scikit-learn.org/stable/_images/math/5ff15825a85204658e3e5aa6e3b5952b8f709c27.pngLearning in Python (Scikit-learn)" />

这个估计不会有太大的性能提升，对于Boston数据，因为本来特征就不稀疏，后面可以试试newsgroup20。那个够稀疏。

'''
Author: Miao Fan
Affiliation: Department of Computer Science and Technology, Tsinghua University, P.R.China.
Email: fanmiao.cslt.thu@gmail.com
'''

import sklearn.datasets
import sklearn.linear_model
import numpy.random
import numpy.linalg
import matplotlib.pyplot

if __name__ == "__main__":
# Load boston dataset
boston = sklearn.datasets.load_boston()

# Split the dataset with sampleRatio
sampleRatio = 0.5
n_samples = len(boston.target)
sampleBoundary = int(n_samples * sampleRatio)

# Shuffle the whole data
shuffleIdx = range(n_samples)
numpy.random.shuffle(shuffleIdx)

# Make the training data
train_features = boston.data[shuffleIdx[:sampleBoundary]]
train_targets = boston.target[shuffleIdx [:sampleBoundary]]

# Make the testing data
test_features = boston.data[shuffleIdx[sampleBoundary:]]
test_targets = boston.target[shuffleIdx[sampleBoundary:]]

# Train
lasso = sklearn.linear_model.LassoCV(alphas=[0.01, 0.05, 0.1, 0.5, 1.0, 10.0])

lasso.fit(train_features, train_targets)
print "Alpha = ", lasso.alpha_
# Predict
predict_targets = lasso.predict(test_features)

# Evaluation
n_test_samples = len(test_targets)
X = range(n_test_samples)
error = numpy.linalg.norm(predict_targets - test_targets, ord = 1) / n_test_samples
print "Lasso (Boston) Error: %.2f" %(error)
# Draw

matplotlib.pyplot.plot(X, predict_targets, 'r--', label = 'Predict Price')
matplotlib.pyplot.plot(X, test_targets, 'g:', label='True Price')
legend = matplotlib.pyplot.legend()
matplotlib.pyplot.title("Lasso (Boston)")
matplotlib.pyplot.ylabel("Price (1000 U.S.D)")
matplotlib.pyplot.savefig("Lasso (Boston).png", format='png')
matplotlib.pyplot.show()

输出：

Alpha = 0.01
Lasso (Boston) Error: 3.39

这个结果的振幅还是比较大的。特别是对于低价位的振幅。

http://www.changweibo.com/ueditor/php/upload/20140729/14066224216435.jpgLearning in Python (Scikit-learn)" />

1.1.4. Elastic Net

这个不知道中文怎么说合适，其实就是兼顾了上面两个正则项（L1和L2两个先验（Prior）），既保证能够训练出一组比较稀疏的模型（Lasso的贡献），同时还能兼具岭回归L2的好处。这个我没试过，不知道啥样的数据这么做最合适，回头我试几个数据集，比较一下普通的线性回归和这个模型的性能。

http://fmn.rrimg.com/fmn057/20140725/1715/original_Nfrv_53b50000bfa7125d.jpgLearning in Python (Scikit-learn)" />

很自然地，要用一个额外的参数来平衡这两个先验约束，一个是惩罚因子alpha，这个之前也有，另一个就是。这些参数都可以用交叉验证CV来搞定（每个线性模型都有相应的CV方法，比如ElasticNetCV就是用来干这个的，其实这种CV方法就是模型选择的范畴了，因为每个不同的额外参数，不是你要学习的W。比如惩罚因子，平衡因子等等，这些构成了不同的数学模型，CV的目标就是来选择合适的模型，然后再去学习W）。这把来个大锅烩，两种范数都用上了：

'''
Author: Miao Fan
Affiliation: Department of Computer Science and Technology, Tsinghua University, P.R.China.
Email: fanmiao.cslt.thu@gmail.com
'''

import sklearn.datasets
import sklearn.linear_model
import numpy.random
import numpy.linalg
import matplotlib.pyplot

if __name__ == "__main__":
# Load boston dataset
boston = sklearn.datasets.load_boston()

# Split the dataset with sampleRatio
sampleRatio = 0.5
n_samples = len(boston.target)
sampleBoundary = int(n_samples * sampleRatio)

# Shuffle the whole data
shuffleIdx = range(n_samples)
numpy.random.shuffle(shuffleIdx)

# Make the training data
train_features = boston.data[shuffleIdx[:sampleBoundary]]
train_targets = boston.target[shuffleIdx [:sampleBoundary]]

# Make the testing data
test_features = boston.data[shuffleIdx[sampleBoundary:]]
test_targets = boston.target[shuffleIdx[sampleBoundary:]]

# Train
elasticNet = sklearn.linear_model.ElasticNetCV(alphas=[0.01, 0.05, 0.1, 0.5, 1.0, 10.0], l1_ratio=[0.1,0.3,0.5,0.7,0.9])

elasticNet.fit(train_features, train_targets)
print "Alpha = ", elasticNet.alpha_
print "L1 Ratio = ", elasticNet.l1_ratio_
# Predict
predict_targets = elasticNet.predict(test_features)

# Evaluation
n_test_samples = len(test_targets)
X = range(n_test_samples)
error = numpy.linalg.norm(predict_targets - test_targets, ord = 1) / n_test_samples
print "Elastic Net (Boston) Error: %.2f" %(error)
# Draw

matplotlib.pyplot.plot(X, predict_targets, 'r--', label = 'Predict Price')
matplotlib.pyplot.plot(X, test_targets, 'g:', label='True Price')
legend = matplotlib.pyplot.legend()
matplotlib.pyplot.title("Elastic Net (Boston)")
matplotlib.pyplot.ylabel("Price (1000 U.S.D)")
matplotlib.pyplot.savefig("Elastic Net (Boston).png", format='png')
matplotlib.pyplot.show()

输出：

Alpha = 0.01
L1 Ratio = 0.9
Elastic Net (Boston) Error: 3.14

貌似还是混合所有制比较牛逼！知道这年头审论文最怕遇到题目里面有啥么？Hybird...，这尼玛性能不提升都对不起这个单词。。。

http://www.changweibo.com/ueditor/php/upload/20140729/14066233818048.jpgLearning in Python (Scikit-learn)" />

1.1.10. Logistic regression

这里补充一个比较实用的逻辑斯蒂回归，虽然名字叫这个，但是一般用在分类上。

采用这个函数来表达具体样本的特征加权组合能够分到哪个类别上（注：下面的图片来自博客http://blog.csdn.net/marvin521/article/details/9263483）

下面的这个sigmod函数对于z值特别敏感，但是他的优点在于他是连续可导的，这个非常重要，便于我们用梯度法计算W。

http://img.blog.csdn.net/20130707152157937?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvY3VvcXU=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/SouthEastLearning in Python (Scikit-learn)" TITLE="Machine Learning in Python (Scikit-learn)" />

http://www.changweibo.com/ueditor/php/upload/20140729/1406623918826.jpgLearning in Python (Scikit-learn)" />

事实证明，Logistic Regression做分类非常好用也很易用，据说Goolge对点击率CTR的预测也会用到这个模型，这个我没有考证过，只是听说，不过下面的代码对Iris的分类结果倒是也能说明这个做分类也是挺好用的（这里强调，我们经常看到Logistic Regression用来做二分类，事实上它可以拓展到对多类分类上，我这里不过多介绍，大家可以查Softmax Regression做参考）。

我们使用Iris的数据来测试一下：

大致回顾一下Iris（牵牛花（数据篇有详细介绍））的数据特点：150个样本，3类，每类基本50条数据，每个数据条目4中特征，都是连续数值类型。我们的目标就是把随机抽取的50%（切记要随机打乱数据，这个数据原始的顺序不是打乱的，前50条都是一个类别，别弄错了。）的数据做个类别0,1,2的预测。

'''
Author: Miao Fan
Affiliation: Department of Computer Science and Technology, Tsinghua University, P.R.China.
Email: fanmiao.cslt.thu@gmail.com
'''

import sklearn.datasets
import sklearn.linear_model
import numpy.random
import matplotlib.pyplot

if __name__ == "__main__":
# Load iris dataset
iris = sklearn.datasets.load_iris()

# Split the dataset with sampleRatio
sampleRatio = 0.5
n_samples = len(iris.target)
sampleBoundary = int(n_samples * sampleRatio)

# Shuffle the whole data
shuffleIdx = range(n_samples)
numpy.random.shuffle(shuffleIdx)

# Make the training data
train_features = iris.data[shuffleIdx[:sampleBoundary]]
train_targets = iris.target[shuffleIdx [:sampleBoundary]]

# Make the testing data
test_features = iris.data[shuffleIdx[sampleBoundary:]]
test_targets = iris.target[shuffleIdx[sampleBoundary:]]

# Train
logisticRegression = sklearn.linear_model.LogisticRegression()
logisticRegression.fit(train_features, train_targets)
# Predict
predict_targets = logisticRegression.predict(test_features)

# Evaluation
n_test_samples = len(test_targets)
X = range(n_test_samples)
correctNum = 0
for i in X:
if predict_targets[i] == test_targets[i]:
correctNum += 1
accuracy = correctNum * 1.0 / n_test_samples
print "Logistic Regression (Iris) Accuracy: %.2f" %(accuracy)
# Draw

matplotlib.pyplot.subplot(2, 1, 1)
matplotlib.pyplot.title("Logistic Regression (Iris)")
matplotlib.pyplot.plot(X, predict_targets, 'ro-', label = 'Predict Labels')
matplotlib.pyplot.ylabel("Predict Class")
legend = matplotlib.pyplot.legend()

matplotlib.pyplot.subplot(2, 1, 2)
matplotlib.pyplot.plot(X, test_targets, 'g+-', label='True Labels')
legend = matplotlib.pyplot.legend()

matplotlib.pyplot.ylabel("True Class")
matplotlib.pyplot.savefig("Logistic Regression (Iris).png", format='png')
matplotlib.pyplot.show()

输出：

Logistic Regression (Iris) Accuracy: 0.95

使用50%作训练，50%做测试，分类精度可以达到95%。

下面这个图算是一个直观的辅助，因为分类精度比较高，所以预测类别和真实类别对应的走势几乎相同：

http://www.changweibo.com/ueditor/php/upload/20140729/14066240061887.jpgLearning in Python (Scikit-learn)" />

1.2. Support Vector Machines

1.3. Stochastic Gradient Descent

1.4. Nearest Neighbors

1.4.2. Nearest Neighbors Classification

借着刚刚更新过的Logistic Regression 对 Iris做分类的余兴，我们来看看使用近邻法是怎么做分类（近邻法不仅能做分类，还能回归，我先介绍分类，这个比较好懂）的。这个算是基于实例的分类方法，和前面介绍的回归啊，分类啊这些方法都不同，之前都是要训练出一个具体的数学函数，对吧。这种近邻法不需要预先训练出什么公式。近邻法的思想很简单，“物以类聚，人以群分”，特征相似的，类别最相近。KNN（K Nearest Neighbor）的意思就是在某个待分类的样本周围找K个根据特征度量距离最近的K个已知类别的样本，这K个样本里面，如果某个类别个数最多，那么这个待分类的样本就从属于那个类别。意思就是，找特性最相近的朋党，然后少数服从多数。

当然，这个工具包也没有那么简单，除了KNN（KNeighborsClassifier）还有RNN（RadiusNeighborsClassifier），说白了，KNN不在乎那K个最近的点到底离你有多远，反正总有相对最近的K个。但是RNN要考虑半径Radius，在待测样本以Radius为半径画个球（如果是二维特征就是圆，三维特征以上，你可以理解为一个超球面），这个球里面的都算进来，这样就不能保证每个待测样本都能考虑相同数量的最近样本。

同时，我们也可以根据距离的远近来对这些已知类别的样本的投票进行加权，这个想法当然很自然。后面的代码都会体现。

我们还是用Iris来测试一下，这次采样比例弄得狠了点，20%训练，80%用来预测测试，就是为了区别一下两种距离加权方式[unifrom, distance]。

'''
Author: Miao Fan
Affiliation: Department of Computer Science and Technology, Tsinghua University, P.R.China.
Email: fanmiao.cslt.thu@gmail.com
'''

import sklearn.datasets
import sklearn.neighbors
import numpy.random
import matplotlib.pyplot
import matplotlib.colors

if __name__ == "__main__":
# Load iris dataset
iris = sklearn.datasets.load_iris()

# Split the dataset with sampleRatio
sampleRatio = 0.2
n_samples = len(iris.target)

sampleBoundary = int(n_samples * sampleRatio)

# Shuffle the whole data
shuffleIdx = range(n_samples)
numpy.random.shuffle(shuffleIdx)

# Make the training data
train_features = iris.data[shuffleIdx[:sampleBoundary]]
train_targets = iris.target[shuffleIdx [:sampleBoundary]]

# Make the testing data
test_features = iris.data[shuffleIdx[sampleBoundary:]]
test_targets = iris.target[shuffleIdx[sampleBoundary:]]

# Train
n_neighbors = 5 #选5个最近邻

for weights in ['uniform', 'distance']: #这个地方采用两种加权方式
kNeighborsClassifier = sklearn.neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
kNeighborsClassifier.fit(train_features, train_targets)

# Test
predict_targets = kNeighborsClassifier.predict(test_features)

#Evaluation
n_test_samples = len(test_targets)
X = range(n_test_samples)
correctNum = 0
for i in X:
if predict_targets[i] == test_targets[i]:
correctNum += 1
accuracy = correctNum * 1.0 / n_test_samples
print "K Neighbors Classifier (Iris) Accuracy [weight = '%s']: %.2f" %(weights, accuracy)

# Draw
cmap_bold = matplotlib.colors.ListedColormap(['red', 'blue', 'green'])
X_test = test_features[:, 2:4]
X_train = train_features[:, 2:4]
matplotlib.pyplot.scatter(X_train[:, 0], X_train[:, 1], label = 'train samples', marker='o', c = train_targets, cmap=cmap_bold,)
matplotlib.pyplot.scatter(X_test[:,0], X_test[:, 1], label = 'test samples', marker='+', c = predict_targets, cmap=cmap_bold)
legend = matplotlib.pyplot.legend()
matplotlib.pyplot.title("K Neighbors Classifier (Iris) [weight = %s]" %(weights))
matplotlib.pyplot.savefig("K Neighbors Classifier (Iris) [weight = %s].png" %(weights), format='png')
matplotlib.pyplot.show()

输出：

K Neighbors Classifier (Iris) Accuracy [weight = 'uniform']: 0.91
K Neighbors Classifier (Iris) Accuracy [weight = 'distance']: 0.93

加权方法略好一点，大约提升2%的精度（注意这两个图，我只是采用了其中的两个维度特征进行的重建，事实上应该有4个维度）：

http://www.changweibo.com/ueditor/php/upload/20140729/14066241989630.jpgLearning in Python (Scikit-learn)" />

1.5. Gaussian Processes

1.6. Cross decomposition

1.7. Naive Bayes

1.8. Decision Trees

1.9. Ensemble methods

1.10. Multiclass and multilabel algorithms

1.11. Feature selection

1.12. Semi-Supervised

1.13. Linear and quadratic discriminant analysis

1.14. Isotonic regression

2. Unsupervised learning

然后让我们开始无监督学习：（聚类啊，概率密度估计（离群点检测）啊，数据降维啊）等等。相对而言，这个部分的工具还是比起许多其他ML包要丰富地多！什么流形学习啊都有。

2.1. Gaussian mixture models

2.2. Manifold learning

2.3. Clustering

2.4. Biclustering

2.5. Decomposing signals in components (matrix factorization problems)

2.6. Covariance estimation

2.7. Novelty and Outlier Detection

2.8. Density Estimation

2.9. Neural network models (unsupervised)

3. Model selection and evaluation

模型选择有的时候，特别是在使用ML创业的时候更需要把握。其实好多问题不同模型都差不多到80%精度，后面怎么提升才是重点。不止一个小伙伴想要用Deep Learning 这个话题作为噱头准备9月份的博士或者硕士开题，那玩意儿想做好，你还真得有耐心调参数，回想起MSRA我那同一排的大婶（神）们，都是NIPS啊！！！丫的，1%的提升都要尖叫了:)，其实我想说，妹的，参数不一样呗。。。这就是Black Magic（黑魔法）。玩深度学习的多了，估计以后不是模型值钱，是参数值钱了。

另外就是特征选择，这个玩意儿也有讲究，如果真正用ML创业，其实模型还是那些模型，特征和参数的选择往往更能看出这个人的水平，别瞎试，千万别。。。

3.1. Cross-validation: evaluating estimator performance

3.2. Grid Search: Searching for estimator parameters

3.3. Pipeline: chaining estimators

3.4. FeatureUnion: Combining feature extractors

3.5. Model evaluation: quantifying the quality of predictions

3.6. Model persistence

3.7. Validation curves: plotting scores to evaluate models

4. Dataset transformations

4.1. Feature extraction

4.2. Preprocessing data

4.3. Kernel Approximation

4.4. Random Projection

4.5. Pairwise metrics, Affinities and Kernels

5. Dataset loading utilities

5.1. General dataset API

5.2. Toy datasets

5.3. Sample images

5.4. Sample generators

5.5. Datasets in svmlight / libsvm format

5.6. The Olivetti faces dataset

5.7. The 20 newsgroups text dataset

5.8. Downloading datasets from the mldata.org repository

5.9. The Labeled Faces in the Wild face recognition dataset

5.10. Forest covertypes

6. Scaling Strategies

6.1. Scaling with instances using out-of-core learning

7. Computational Performance

7.1. Prediction Latency

7.2. Prediction Throughput

7.3. Tips and Tricks

来源网络：范淼

http://www.changweibo.com/ueditor/php/upload/20140729/14066215626743.jpgLearning in Python (Scikit-learn)" />

阅读┊ 收藏 ┊ 喜欢 ▼ ┊打印┊举报/Report

前一篇：今天才知道旧鞋盒有这么多的用处

后一篇：如何跨行业（部门）跳槽

新浪BLOG意见反馈留言板　欢迎批评指正