Python-Pandas-Dataframe数据分析之SQL功能实现(二)
(2018-03-22 10:57:00)
标签:
pythonpandasdataframesql数据分析 |
分类: 博君一笑 |
In [19]:
import pandas as pd df = pd.read_csv('data.txt',names=['id', 'os', 'dim', 'uv', 'pv'],sep=',') df
Out[19]:
id | os | dim | uv | pv | |
---|---|---|---|---|---|
0 | 0 | android | NLL | 387546520 | 2099457911 |
1 | 0 | ios | NLL | 52877990 | 916421755 |
2 | 1 | android | 魅族 | 8995958 | 120369597 |
3 | 1 | android | 酷派 | 9915906 | 200818172 |
4 | 1 | android | 三星 | 16500493 | 718969514 |
5 | 1 | android | 小米 | 23933856 | 290787590 |
6 | 1 | android | 华为 | 26706736 | 641907761 |
7 | 1 | ios | 苹果 | 52877990 | 916421755 |
8 | 2 | android | 小米-小米4 | 2786675 | 55376581 |
9 | 2 | android | 魅族-m2-note | 4642112 | 130984205 |
10 | 2 | android | OPPO-A31 | 4893428 | 62976997 |
11 | 2 | ios | 苹果-iPhone-6s | 5728609 | 99948716 |
In [20]:
import numpy as np row_df = pd.DataFrame(np.array([['2', 'ios', '苹果-iPad 4', 3287509, 32891811]]), columns=['id', 'os', 'dim', 'uv', 'pv']) df = df.append(row_df, ignore_index=True) df
Out[20]:
id | os | dim | uv | pv | |
---|---|---|---|---|---|
0 | 0 | android | NLL | 387546520 | 2099457911 |
1 | 0 | ios | NLL | 52877990 | 916421755 |
2 | 1 | android | 魅族 | 8995958 | 120369597 |
3 | 1 | android | 酷派 | 9915906 | 200818172 |
4 | 1 | android | 三星 | 16500493 | 718969514 |
5 | 1 | android | 小米 | 23933856 | 290787590 |
6 | 1 | android | 华为 | 26706736 | 641907761 |
7 | 1 | ios | 苹果 | 52877990 | 916421755 |
8 | 2 | android | 小米-小米4 | 2786675 | 55376581 |
9 | 2 | android | 魅族-m2-note | 4642112 | 130984205 |
10 | 2 | android | OPPO-A31 | 4893428 | 62976997 |
11 | 2 | ios | 苹果-iPhone-6s | 5728609 | 99948716 |
12 | 2 | ios | 苹果-iPad 4 | 3287509 | 32891811 |
In [21]:
df['time'] = '2016-07-19' df
Out[21]:
id | os | dim | uv | pv | time | |
---|---|---|---|---|---|---|
0 | 0 | android | NLL | 387546520 | 2099457911 | 2016-07-19 |
1 | 0 | ios | NLL | 52877990 | 916421755 | 2016-07-19 |
2 | 1 | android | 魅族 | 8995958 | 120369597 | 2016-07-19 |
3 | 1 | android | 酷派 | 9915906 | 200818172 | 2016-07-19 |
4 | 1 | android | 三星 | 16500493 | 718969514 | 2016-07-19 |
5 | 1 | android | 小米 | 23933856 | 290787590 | 2016-07-19 |
6 | 1 | android | 华为 | 26706736 | 641907761 | 2016-07-19 |
7 | 1 | ios | 苹果 | 52877990 | 916421755 | 2016-07-19 |
8 | 2 | android | 小米-小米4 | 2786675 | 55376581 | 2016-07-19 |
9 | 2 | android | 魅族-m2-note | 4642112 | 130984205 | 2016-07-19 |
10 | 2 | android | OPPO-A31 | 4893428 | 62976997 | 2016-07-19 |
11 | 2 | ios | 苹果-iPhone-6s | 5728609 | 99948716 | 2016-07-19 |
12 | 2 | ios | 苹果-iPad 4 | 3287509 | 32891811 | 2016-07-19 |
In [22]:
def where(df, column_name, id_value): df = df[df[column_name] == id_value] return df def to_dict(df): """ {"pv" or "uv" -> {"os": os_value}} :return: dict """ df = where(df, 'id', 0) df_dict = df.set_index('os')[['uv', 'pv']].to_dict() return df_dict
In [23]:
def group_top(df, group_col, sort_col, top_n): """ get top(`sort_col`) after group by `group_col` :param df: dataframe :param group_col: string, column name :param sort_col: string, column name :param top_n: int :return: dataframe """ return df.assign(rn=df.sort_values([sort_col], ascending=False) .groupby(group_col) .cumcount() + 1) \ .query('rn < ' + str(top_n + 1)) \ .sort_values([group_col, 'rn']) def top(df, group_col, sort_col, top_n): """overall top and group top""" all_top_df = df.nlargest(top_n, columns=sort_col) grouped_top_df = group_top(df, group_col, sort_col, top_n) grouped_top_df = grouped_top_df.ix[:, 0:-1] result_df = pd.concat([all_top_df, grouped_top_df]).drop_duplicates() return result_df
In [24]:
df['rank'] = df['uv'].rank(method='first', ascending=False).apply(lambda x: int(x)) df
Out[24]:
id | os | dim | uv | pv | time | rank | |
---|---|---|---|---|---|---|---|
0 | 0 | android | NLL | 387546520 | 2099457911 | 2016-07-19 | NaN |
1 | 0 | ios | NLL | 52877990 | 916421755 | 2016-07-19 | NaN |
2 | 1 | android | 魅族 | 8995958 | 120369597 | 2016-07-19 | NaN |
3 | 1 | android | 酷派 | 9915906 | 200818172 | 2016-07-19 | NaN |
4 | 1 | android | 三星 | 16500493 | 718969514 | 2016-07-19 | NaN |
5 | 1 | android | 小米 | 23933856 | 290787590 | 2016-07-19 | NaN |
6 | 1 | android | 华为 | 26706736 | 641907761 | 2016-07-19 | NaN |
7 | 1 | ios | 苹果 | 52877990 | 916421755 | 2016-07-19 | NaN |
8 | 2 | android | 小米-小米4 | 2786675 | 55376581 | 2016-07-19 | NaN |
9 | 2 | android | 魅族-m2-note | 4642112 | 130984205 | 2016-07-19 | NaN |
10 | 2 | android | OPPO-A31 | 4893428 | 62976997 | 2016-07-19 | NaN |
11 | 2 | ios | 苹果-iPhone-6s | 5728609 | 99948716 | 2016-07-19 | NaN |
12 | 2 | ios | 苹果-iPad 4 | 3287509 | 32891811 | 2016-07-19 | NaN |
In [25]:
def left_join(left, right, on, right_col, default_value): df = pd.merge(left, right, how='left', on=on) df[right_col] = df[right_col].map(lambda x: default_value if pd.isnull(x) else x) return df
In [26]:
def percentage(part, whole): return round(100*float(part)/float(whole), 2) os_dict = to_dict(df) all_uv = sum(os_dict['uv'].values()) df = where(df, 'id', 1) df['per'] = df.apply(lambda r: percentage(r['uv'], all_uv), axis=1) df
D:\Anaconda2\envs\py3\lib\site-packages\ipykernel_launcher.py:8: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Out[26]:
id | os | dim | uv | pv | time | rank | per | |
---|---|---|---|---|---|---|---|---|
2 | 1 | android | 魅族 | 8995958 | 120369597 | 2016-07-19 | NaN | 2.04 |
3 | 1 | android | 酷派 | 9915906 | 200818172 | 2016-07-19 | NaN | 2.25 |
4 | 1 | android | 三星 | 16500493 | 718969514 | 2016-07-19 | NaN | 3.75 |
5 | 1 | android | 小米 | 23933856 | 290787590 | 2016-07-19 | NaN | 5.43 |
6 | 1 | android | 华为 | 26706736 | 641907761 | 2016-07-19 | NaN | 6.06 |
7 | 1 | ios | 苹果 | 52877990 | 916421755 | 2016-07-19 | NaN | 12.01 |
In [30]:
duplicate = df.duplicated(subset=['os','rank'], keep=False) duplicate
Out[30]:
2 True 3 True 4 True 5 True 6 True 7 False dtype: bool