Python处理categorical变量_bicloud

http://blog.sina.com.cn/u/1640260361

首页博文目录关于我

个人资料

微博

加好友发纸条

写留言加关注

博客等级：
博客积分：

博客访问：
关注人气：
获赠金笔：0支
赠出金笔：0支
荣誉徽章：

正文字体大小：大中小

Python处理categorical变量

(2014-04-12 15:31:07)

标签：

python

one-hot-encode

it

分类：数据挖掘

在实际建模过程中，我们经常需要对离散型变量进行encode处理，譬如性别，类目，标签等等。通过one hot encode方法进行处理。经验表明，这些操作处理对提升预测模型性能有帮助。

http://en.wikipedia.org/wiki/One-hot

# -*- coding: utf-8 -*-

""" Small script that shows hot to do one hot encoding

of categorical columns in a pandas DataFrame.

See:

http://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder

http://scikit-learn.org/dev/modules/generated/sklearn.feature_extraction.DictVectorizer.html

"""

import pandas

import random

import numpy

from sklearn.feature_extraction import DictVectorizer

def one_hot_dataframe(data, cols, replace=False):

""" Takes a dataframe and a list of columns that need to be encoded.

Returns a 3-tuple comprising the data, the vectorized data,

and the fitted vectorizor.

"""

vec = DictVectorizer()

mkdict = lambda row: dict((col, row[col]) for col in cols)

# manuplate the column

vecData = pandas.DataFrame(vec.fit_transform(data[cols].apply(mkdict, axis=1)).toarray())

# get column names

vecData.columns = vec.get_feature_names()

vecData.index = data.index

if replace is True:

data = data.drop(cols, axis=1)

#column join based on index

data = data.join(vecData)

return (data, vecData, vec)

def main():

# Get a random DataFrame

df = pandas.DataFrame(numpy.random.randn(25, 3), columns=['a', 'b', 'c'])

# Make some random categorical columns

df['e'] = [random.choice(('Chicago', 'Boston', 'New York')) for i in range(df.shape[0])]

df['f'] = [random.choice(('Chrome', 'Firefox', 'Opera', "Safari")) for i in range(df.shape[0])]

print df

# Vectorize the categorical columns: e & f

df, _, _ = one_hot_dataframe(df, ['e', 'f'], replace=True)

print df

if __name__ == '__main__':

main()

原始数据

http://s8/mw690/001N0mEhgy6I30UKESza7&690
one-hot encode数据
 http://s3/mw690/001N0mEhgy6I30UNNUS42&690

阅读┊ 收藏 ┊ 喜欢 ▼ ┊打印┊举报/Report

前一篇：2014.3月阅读书籍

后一篇：python sklearn glm 交叉验证

新浪BLOG意见反馈留言板　欢迎批评指正