PCA中降维数确定方法
标签:
pca维数 |
分类: 机器学习 |
PCA中降维到多少维比较合适,这个参数怎么调呢?
df =
pd.read_csv("data/train.csv")
# 获取图像值和标签值
img_datas =
df.drop('label', 1)
label =
df['label']
# 将数据分训练组和测试组
X_train, X_test,
y_train, y_test = train_test_split(img_datas, label,
test_size=0.2)
# 标准化图像数据
X = X_train.values
X_std =
StandardScaler().fit_transform(X)
# 计算特征值和特征向量
cov_mat =
np.cov(X_std.T)
eig_vals, eig_vecs =
np.linalg.eig(cov_mat)
#
取特征值和它对应的特征向量组成一个pair
eig_pairs =
[(np.abs(eig_vals[i]), eig_vecs[:, i]) for i in
range(len(eig_vals))]
# 按照特征值从大到小排列
eig_pairs.sort(key =
lambda x: x[0], reverse=True)
# 评价特征值对方差的解释性
tot =
sum(eig_vals)
var_exp = [(i/tot)*100
for i in sorted(eig_vals, reverse=True)]
cum_var_exp =
np.cumsum(var_exp)
# 画图查看最佳维数
plt.plot(var_exp)
plt.plot(cum_var_exp)
plt.show()
# 降维并训练模型
pca =
PCA(n_components=220)
X_train_reduced =
pca.fit_transform(X_train)
X_test_reduced =
pca.transform(X_test)
classifier =
LogisticRegression(penalty='l2', C=1,solver='lbfgs',
multi_class='multinomial')
classifier.fit(X_train_reduced, y_train)
# 预测值并评估结果
predictions =
classifier.predict(X_test_reduced)
print(classification_report(y_test, predictions))
main()
请参考文档:https://www.kaggle.com/arthurtok/interactive-intro-to-dimensionality-reduction
大致原理:
1、样本数据标准化
2、获取协方差矩阵
3、获取矩阵的特征值和特征向量
4、特征值从大到小排列,计算其累加值
5、分析累加值曲线,获取拐点处,对应的维数即是想要的值
曲线图样例:
代码:
#! usr/bin/python
# coding=utf-8
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
def main():
if __name__ == '__main__':

加载中…