http://blog.sina.com.cn/u/1920634280

首页博文目录关于我

个人资料

无有-散人

微博

加好友发纸条

写留言加关注

博客等级：
博客积分：29

博客访问：2,239
关注人气：143
获赠金笔：34
赠出金笔：2
荣誉徽章：

正文字体大小：大中小

Python-Pandas-Dataframe数据分析之SQL功能实现（二）

(2018-03-22 10:57:00)

标签：

python

pandas

dataframe

sql

数据分析

分类：博君一笑

1. 引言

本文来自：http://www.cnblogs.com/en-heng/p/5686062.html

前一篇介绍了Pandas实现简单的SQL操作，本篇中将主要介绍一些相对复杂一点的操作。为了方便后面实操，先给出一份简化版的设备统计数据：

0,android,NLL,387546520,2099457911
0,ios,NLL,52877990,916421755
1,android,魅族,8995958,120369597
1,android,酷派,9915906,200818172
1,android,三星,16500493,718969514
1,android,小米,23933856,290787590
1,android,华为,26706736,641907761
1,ios,苹果,52877990,916421755
2,android,小米-小米4,2786675,55376581
2,android,魅族-m2-note,4642112,130984205
2,android,OPPO-A31,4893428,62976997
2,ios,苹果-iPhone-6s,5728609,99948716

其中，第一列表示维度组合编号，第二列表示操作系统类型，第三列为维度值（NLL表示缺失，即第一行、第二行表示操作系统的统计，其余表示厂商或机型），第三列、第四列分别表示UV、PV；且字段之间为\t分隔。读取该文件为DataFrame：

In [19]:

import pandas as pd

df = pd.read_csv('data.txt',names=['id', 'os', 'dim', 'uv', 'pv'],sep=',')

df

Out[19]:

	id	os	dim	uv	pv
0	0	android	NLL	387546520	2099457911
1	0	ios	NLL	52877990	916421755
2	1	android	魅族	8995958	120369597
3	1	android	酷派	9915906	200818172
4	1	android	三星	16500493	718969514
5	1	android	小米	23933856	290787590
6	1	android	华为	26706736	641907761
7	1	ios	苹果	52877990	916421755
8	2	android	小米-小米4	2786675	55376581
9	2	android	魅族-m2-note	4642112	130984205
10	2	android	OPPO-A31	4893428	62976997
11	2	ios	苹果-iPhone-6s	5728609	99948716

2. 实战

Add

增加一行数据；可通过dataframe的append函数来追加：

In [20]:

import numpy as np

row_df = pd.DataFrame(np.array([['2', 'ios', '苹果-iPad 4', 3287509, 32891811]]), columns=['id', 'os', 'dim', 'uv', 'pv'])
df = df.append(row_df, ignore_index=True)

df

Out[20]:

	id	os	dim	uv	pv
0	0	android	NLL	387546520	2099457911
1	0	ios	NLL	52877990	916421755
2	1	android	魅族	8995958	120369597
3	1	android	酷派	9915906	200818172
4	1	android	三星	16500493	718969514
5	1	android	小米	23933856	290787590
6	1	android	华为	26706736	641907761
7	1	ios	苹果	52877990	916421755
8	2	android	小米-小米4	2786675	55376581
9	2	android	魅族-m2-note	4642112	130984205
10	2	android	OPPO-A31	4893428	62976997
11	2	ios	苹果-iPhone-6s	5728609	99948716
12	2	ios	苹果-iPad 4	3287509	32891811

增加一列数据，则比较简单：

In [21]:

df['time'] = '2016-07-19'
df

Out[21]:

	id	os	dim	uv	pv	time
0	0	android	NLL	387546520	2099457911	2016-07-19
1	0	ios	NLL	52877990	916421755	2016-07-19
2	1	android	魅族	8995958	120369597	2016-07-19
3	1	android	酷派	9915906	200818172	2016-07-19
4	1	android	三星	16500493	718969514	2016-07-19
5	1	android	小米	23933856	290787590	2016-07-19
6	1	android	华为	26706736	641907761	2016-07-19
7	1	ios	苹果	52877990	916421755	2016-07-19
8	2	android	小米-小米4	2786675	55376581	2016-07-19
9	2	android	魅族-m2-note	4642112	130984205	2016-07-19
10	2	android	OPPO-A31	4893428	62976997	2016-07-19
11	2	ios	苹果-iPhone-6s	5728609	99948716	2016-07-19
12	2	ios	苹果-iPad 4	3287509	32891811	2016-07-19

To Dict

关于android、ios的PV、UV的dict：

In [22]:

def where(df, column_name, id_value):
    df = df[df[column_name] == id_value]
    return df


def to_dict(df):
    """
    {"pv" or "uv" -> {"os": os_value}}
    :return: dict
    """
    df = where(df, 'id', 0)
    df_dict = df.set_index('os')[['uv', 'pv']].to_dict()
    return df_dict

Top

group某列后的top值，比如，android、ios的UV top 2的厂商：
全局top值加上group某列后的top值，并有去重：

In [23]:

def group_top(df, group_col, sort_col, top_n):
    """
    get top(`sort_col`) after group by `group_col`
    :param df: dataframe
    :param group_col: string, column name
    :param sort_col: string, column name
    :param top_n: int
    :return: dataframe
    """
    return df.assign(rn=df.sort_values([sort_col], ascending=False)
                     .groupby(group_col)
                     .cumcount() + 1) \
        .query('rn < ' + str(top_n + 1)) \
        .sort_values([group_col, 'rn'])
        
def top(df, group_col, sort_col, top_n):
    """overall top and group top"""
    all_top_df = df.nlargest(top_n, columns=sort_col)
    grouped_top_df = group_top(df, group_col, sort_col, top_n)
    grouped_top_df = grouped_top_df.ix[:, 0:-1]
    result_df = pd.concat([all_top_df, grouped_top_df]).drop_duplicates()
    return result_df

排序编号

对某列排序后并编号，相当于给出排序名次。比如，对UV的排序编号：

In [24]:

df['rank'] = df['uv'].rank(method='first', ascending=False).apply(lambda x: int(x))
df

Out[24]:

	id	os	dim	uv	pv	time	rank
0	0	android	NLL	387546520	2099457911	2016-07-19	NaN
1	0	ios	NLL	52877990	916421755	2016-07-19	NaN
2	1	android	魅族	8995958	120369597	2016-07-19	NaN
3	1	android	酷派	9915906	200818172	2016-07-19	NaN
4	1	android	三星	16500493	718969514	2016-07-19	NaN
5	1	android	小米	23933856	290787590	2016-07-19	NaN
6	1	android	华为	26706736	641907761	2016-07-19	NaN
7	1	ios	苹果	52877990	916421755	2016-07-19	NaN
8	2	android	小米-小米4	2786675	55376581	2016-07-19	NaN
9	2	android	魅族-m2-note	4642112	130984205	2016-07-19	NaN
10	2	android	OPPO-A31	4893428	62976997	2016-07-19	NaN
11	2	ios	苹果-iPhone-6s	5728609	99948716	2016-07-19	NaN
12	2	ios	苹果-iPad 4	3287509	32891811	2016-07-19	NaN

Left Join

Pandas的left join对NULL的列没有指定默认值，下面给出简单的实现：

In [25]:

def left_join(left, right, on, right_col, default_value):
    df = pd.merge(left, right, how='left', on=on)
    df[right_col] = df[right_col].map(lambda x: default_value if pd.isnull(x) else x)
    return df

自定义

对某一列做较为复杂的自定义操作，比如，厂商的UV占比：

In [26]:

def percentage(part, whole):
    return round(100*float(part)/float(whole), 2)


os_dict = to_dict(df)
all_uv = sum(os_dict['uv'].values())
df = where(df, 'id', 1)
df['per'] = df.apply(lambda r: percentage(r['uv'], all_uv), axis=1)
df

D:\Anaconda2\envs\py3\lib\site-packages\ipykernel_launcher.py:8: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Out[26]:

	id	os	dim	uv	pv	time	rank	per
2	1	android	魅族	8995958	120369597	2016-07-19	NaN	2.04
3	1	android	酷派	9915906	200818172	2016-07-19	NaN	2.25
4	1	android	三星	16500493	718969514	2016-07-19	NaN	3.75
5	1	android	小米	23933856	290787590	2016-07-19	NaN	5.43
6	1	android	华为	26706736	641907761	2016-07-19	NaN	6.06
7	1	ios	苹果	52877990	916421755	2016-07-19	NaN	12.01

重复值

某列的重复值的行： duplicate = df.duplicated(subset=columns, keep=False)

In [30]:

duplicate = df.duplicated(subset=['os','rank'], keep=False)
duplicate

Out[30]:

2     True
3     True
4     True
5     True
6     True
7    False
dtype: bool

写MySQL

Pandas的to_sql函数支持Dataframe直接写MySQL数据库。在公司开发时，常常会有办公网与研发网是不通的，Python的sshtunnel模块提供ssh通道，便于入库debug。 import MySQLdb from sshtunnel import SSHTunnelForwarder

with SSHTunnelForwarder(('porxy host', port),
                        ssh_password='os passwd',
                        ssh_username='os user name',
                        remote_bind_address=('mysql host', 3306)) as server:
    conn = MySQLdb.connect(host="127.0.0.1", user="mysql user name", passwd="mysql passwd",
                           db="db name", port=server.local_bind_port, charset='utf8')
    df.to_sql(name='tb name', con=conn, flavor='mysql', if_exists='append', index=False)

本文来源：http://www.cnblogs.com/en-heng/p/5686062.html

阅读(900)┊ 收藏(0) ┊ 喜欢 ▼ ┊打印┊举报/Report

前一篇：Python-Pandas-Dataframe数据分析之SQL功能实现

后一篇：Python数据分析从入门到放弃（一）数据分析一般流程

新浪BLOG意见反馈留言板　欢迎批评指正

	赠金笔关闭

	金笔兑换今日土豪榜今日提名榜兑换服务：金笔道具赠送数量：1支10支20支50支100支其他现有金笔：0支还需兑换：1支 x 1=1(积分) 确认点击了解更多规则