加载中…
个人资料
  • 博客等级:
  • 博客积分:
  • 博客访问:
  • 关注人气:
  • 获赠金笔:0支
  • 赠出金笔:0支
  • 荣誉徽章:
正文 字体大小:

使用scikit-learn的随机森林进行分类

(2017-03-23 06:54:13)
标签:

scikit

randomforesttree

分类: 机器学习
1、数据源: http://blog.csdn.net/wiking__acm/article/details/50971461
3、代码:
import pandas as pd
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score,make_scorer

train_data = pd.read_csv('D:\\workspace\\kaggle\\data\\zhouzhihua-gua\\train_data.csv')
test_data = pd.read_csv('D:\\workspace\\kaggle\\data\\zhouzhihua-gua\\test_data.csv')

#将数据转化未label(0-N)形式
def encode_features(df_train, df_test):
    features = ['色泽', '根蒂', '敲声', '纹理', '脐部', '触感']
    df_combined = pd.concat([df_train[features], df_test[features]])
    
    for feature in features:
        le = preprocessing.LabelEncoder()
        le = le.fit(df_combined[feature])
        df_train[feature] = le.transform(df_train[feature])
        df_test[feature] = le.transform(df_test[feature])
    
    return df_train, df_test
 
def simplify_interval_info(df):
    bins_density = (0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8)
    bins_sugar = (0, 0.1, 0.2, 0.3, 0.4, 0.5)
    
    group_name_density = [0, 1, 2, 3, 4, 5, 6, 7]
    group_name_sugar = [0, 1, 2, 3, 4]
    
    category_density = pd.cut(df['密度'], bins_density, labels=group_name_density)
    categroy_sugar = pd.cut(df['含糖率'], bins_sugar, labels=group_name_sugar)
    
    df['密度'] = category_density
    df['含糖率'] = categroy_sugar
    
    return df
    
    

train_data, test_data = encode_features(train_data, test_data)
train_data = simplify_interval_info(train_data)
test_data = simplify_interval_info(test_data)

X_all = train_data.drop(['好瓜'], axis=1)
y_all = train_data['好瓜']
y_result = [1,0,0]


num_test = 0.50
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size=num_test, random_state=3)

# Choose some parameter combinations to try
parameters = {'n_estimators':[5,6,7],
              'criterion':['entropy', 'gini']
              }
              
# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(accuracy_score)
        
clf = RandomForestClassifier()

# Run the grid search
grid_obj = GridSearchCV(clf, parameters, scoring=acc_scorer)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
clf = grid_obj.best_estimator_

clf = clf.fit(X_train, y_train)
test_predictions = clf.predict(X_test)
print("测试集准确率:  %s " % accuracy_score(y_test, test_predictions))

predictions = clf.predict(test_data)
print("最终准确率:  %s " % accuracy_score(y_result, predictions))

0

阅读 收藏 喜欢 打印举报/Report
  

新浪BLOG意见反馈留言板 欢迎批评指正

新浪简介 | About Sina | 广告服务 | 联系我们 | 招聘信息 | 网站律师 | SINA English | 产品答疑

新浪公司 版权所有