初学pandas与seaborn(三)回归曲线生成以及csv数据导入

标签:
anacondapandaspycharmseaborn数据分析 |
分类: 【python与Anaconda科学计算】 |
看个坐标图、点图的曲线啥的,seaborn还是比较方便的。这里可以用seaborn自带的样板库数据来写,也可以导入外部数据。外部数据是自己编的,点点不够多,不过能体会下pandas调用文件数据并用seaborn显示出来的全过程,对整个数据处理编程应该会有一些启发。上图:
好了,这是数据文件:
tips.csv
total_bill,tip,sex,smoker,day,time,size
16.99,1.01,Female,No,Sun,Dinner,2
10.34,1.66,Male,No,Sun,Dinner,3
21.01,3.5,Male,No,Sun,Lunch,3
23.68,3.31,Male,Yes,Sat,Dinner,2
24.59,3.61,Female,No,Sun,Dinner,4
30,5,Female,No,Sat,Dinner,6
29,3,Male,Yes,Sat,Dinner,5
28,3.5,Male,No,Sun,Lunch,6
10,3.31,Male,No,Sun,Dinner,1
9,3.61,Female,Yes,Sat,Dinner,1
64,10,Female,Yes,Sun,Dinner,8
30,4,Male,Yes,Sat,Lunch,5
20,5,Male,Yes,Sun,Dinner,3
24,4,Male,No,Sat,Lunch,4
18,3,Female,No,Sun,Dinner,2
17,1,Female,No,Sat,Dinner,3
16,2,Male,No,Sat,Dinner,2
19,2,Male,No,Sun,Lunch,2
23,4,Male,No,Sun,Dinner,3
22,2,Female,Yes,Sat,Dinner,3
30,1.01,Female,No,Sun,Dinner,5
40,1.66,Male,No,Sun,Dinner,5
33,3.5,Male,No,Sun,Lunch,5
35,18,Male,Yes,Sat,Dinner,6
36,3.61,Female,No,Sun,Dinner,5
38,5,Female,No,Sat,Dinner,5
42,15,Male,Yes,Sat,Dinner,6
44,3.5,Male,No,Sun,Lunch,6
49,3.31,Male,No,Sun,Dinner,5
46,10,Female,Yes,Sat,Dinner,6
47,20,Female,Yes,Sun,Dinner,6
43,11,Male,Yes,Sat,Lunch,7
61,12,Male,Yes,Sun,Dinner,5
51,6,Male,No,Sat,Lunch,8
52,8,Female,No,Sun,Dinner,7
58,8,Female,No,Sat,Dinner,7
55,8,Male,No,Sat,Dinner,8
57,5,Male,No,Sun,Lunch,7
53,6,Male,No,Sun,Dinner,8
62,14,Female,Yes,Sat,Dinner,10
|
复制到记事本并存为csv扩展名即可,tips.csv存储在python文件相同目录下
接下来是程序本体
#
encoding:utf-8
import seaborn as
sns
import pandas as
pd
tips = pd.read_csv("tips.csv")
print tips
sns.jointplot("total_bill",
"tip", tips, kind='reg')
sns.plt.show()
sns.plt.close()
sns.lmplot("total_bill", "tip",
tips, col="smoker")
sns.plt.show()
|
程序分析:
我们用pandas库导入了tips.csv文件,然后选择total_bill(总账单)作为横坐标,tip(小费)作为纵坐标,tips对象作为panda数据集合。第一个图生成类型为reg,第二个图我们用是否smoker作为区分,生成了对比图。
学习过程中还遇到一些莫名其妙的bug,一开始一个叫tips没有扩展名的文件是这样的……好吧,我醉了,这居然是一个build-in的测试数据,seaborn直接自己自带的?!我居然漏看了红色部分的注释:
# Load one of the data sets that come
with seaborn tips = sns.load_dataset("tips") print tips |
看来我惯性思维,认为这个语句一定是导入外部测试数据的。然后就极其傻逼地在网上找啊找。言归正传,我将其打印出来,直接发现有243行数据:
Name: D, dtype:
int32
0
16.99
1.01 Female
No Sun
Dinner
2
1
10.34
1.66 Male
No Sun
Dinner
3
2
21.01
3.50 Male
No Sun
Dinner
3
3
23.68
3.31 Male
No Sun
Dinner
2
4
24.59
3.61 Female
No Sun
Dinner
4
5
25.29
4.71 Male
No Sun
Dinner
4
6
8.77 2.00
Male No
Sun Dinner
2
7
26.88
3.12 Male
No Sun
Dinner
4
8
15.04
1.96 Male
No Sun
Dinner
2
9
14.78
3.23 Male
No Sun
Dinner
2
10
10.27
1.71 Male
No Sun
Dinner
2
11
35.26
5.00 Female
No Sun
Dinner
4
12
15.42
1.57 Male
No Sun
Dinner
2
13
18.43
3.00 Male
No Sun
Dinner
4
14
14.83
3.02 Female
No Sun
Dinner
2
15
21.58
3.92 Male
No Sun
Dinner
2
16
10.33
1.67 Female
No Sun
Dinner
3
17
16.29
3.71 Male
No Sun
Dinner
3
18
16.97
3.50 Female
No Sun
Dinner
3
19
20.65
3.35 Male
No Sat
Dinner
3
20
17.92
4.08 Male
No Sat
Dinner
2
21
20.29
2.75 Female
No Sat
Dinner
2
22
15.77
2.23 Female
No Sat
Dinner
2
23
39.42
7.58 Male
No Sat
Dinner
4
24
19.82
3.18 Male
No Sat
Dinner
2
25
17.81
2.34 Male
No Sat
Dinner
4
26
13.37
2.00 Male
No Sat
Dinner
2
27
12.69
2.00 Male
No Sat
Dinner
2
28
21.70
4.30 Male
No Sat
Dinner
2
29
19.65
3.00 Female
No Sat
Dinner
2
..
... ...
... ...
... ...
...
214
28.17 6.50
Female Yes
Sat Dinner
3
215
12.90 1.10
Female Yes
Sat Dinner
2
216
28.15 3.00
Male
Yes Sat
Dinner
5
217
11.59 1.50
Male
Yes Sat
Dinner
2
218
7.74
1.44 Male
Yes Sat
Dinner
2
219
30.14 3.09
Female Yes
Sat Dinner
4
220
12.16 2.20
Male
Yes Fri
Lunch 2
221
13.42 3.48
Female Yes
Fri Lunch
2
222
8.58
1.92 Male
Yes Fri
Lunch
1
223
15.98 3.00
Female No
Fri Lunch
3
224
13.42 1.58
Male
Yes Fri
Lunch 2
225
16.27 2.50
Female Yes
Fri Lunch
2
226
10.09 2.00
Female Yes
Fri Lunch
2
227
20.45 3.00
Male
No Sat
Dinner
4
228
13.28 2.72
Male
No Sat
Dinner
2
229
22.12 2.88
Female Yes
Sat Dinner
2
230
24.01 2.00
Male
Yes Sat
Dinner
4
231
15.69 3.00
Male
Yes Sat
Dinner
3
232
11.61 3.39
Male
No Sat
Dinner
2
233
10.77 1.47
Male
No Sat
Dinner
2
234
15.53 3.00
Male
Yes Sat
Dinner
2
235
10.07 1.25
Male
No Sat
Dinner
2
236
12.60 1.00
Male
Yes Sat
Dinner
2
237
32.83 1.17
Male
Yes Sat
Dinner
2
238
35.83 4.67
Female No
Sat Dinner
3
239
29.03 5.92
Male
No Sat
Dinner
3
240
27.18 2.00
Female Yes
Sat Dinner
2
241
22.67 2.00
Male
Yes Sat
Dinner
2
242
17.82 1.75
Male
No Sat
Dinner
2
243
18.78 3.00
Female No
Thur Dinner
2
好吧问题其实解决了。如果需要导入自己写的csv,请一定用pandas库。直接用seaborn的load_dataset函数是不行的,会报错:
tips = sns.load_dataset("tips.csv") |
pandas.io.common.CParserError:
Error tokenizing data. C error:
Expected 1 fields in line 13, saw 2
按照stackoverflow提示的方法加了error_bad_lines=False后,
tips = sns.load_dataset("tips.csv", error_bad_lines=False) |
又报错:
TypeError: ufunc 'multiply' did not contain a
loop with signature matching types dtype('S3') dtype('S3')
dtype('S3')
最终解决办法,还是用回pandas库的read_csv方法:
tips = pd.read_csv("tips.csv") |