Python数据分析从入门到放弃(十二)爬虫四——爬取淘宝
(2018-04-26 09:30:35)
标签:
python爬虫入门数据分析淘宝 |
分类: Python |
In [9]:
import requests import re lst = [] #存放商品信息,以防重复
In [10]:
class TBSpider: #淘宝爬虫类 url = 'https://s.taobao.com/search' payload = {'q': 'python', 's': '1', 'ie': 'utf8'} # 字典传递url参数 file_name = 'tbdata.txt' #输出文件,就是csv格式,后缀名不是问题 fl=None # 初始化,建立文件标题行,q为需要查询的物品信息 def __init__(self,q=''): self.url='https://s.taobao.com/search' self.payload['q']=q self.fl = open(self.file_name,'w',encoding='utf-8') self.fl.writelines('"编号","标题","价格","产地","销售量","店铺"\n') #在结束时关闭文件 def __del__(self): self.fl.close() #开始爬虫 def start_spider(self): page_count = self.get_page_count('') for i in range(0,page_count): self.payload['s']=44*i #注意翻页技巧,根据URL自行判断 resp = requests.get(self.url, params=self.payload) resp.encoding='utf-8' self.get_content(html=resp.text) print('Getting content ',i) def get_content(self,html): try: title = re.findall(r'"raw_title":"([^"]+)"', html, re.I) loc = re.findall(r'"item_loc":"([^"]+)"',html,re.I) price = re.findall(r'"view_price":"([^"]+)"', html, re.I) nid = re.findall(r'"nid":"([^"]+)"', html, re.I) sales = re.findall(r'"view_sales":"([^"]+)"', html, re.I) nick = re.findall(r'"nick":"([^"]+)"', html, re.I) k = len(title) for i in range(0,k): if nid[i] in lst: continue lst.append(nid[i]) p = sales[i][:-3] str = '"' + nid[i] + '",' str += '"' + title[i] + '",' str += '"' + price[i] + '",' str += '"' + loc[i] + '",' str += '"' + p + '",' str += '"' + nick[i] + '"\n' self.fl.writelines(str) except: print('Error ') def get_page_count(self,html): return 100
In [11]:
def main(): q='红酒真空瓶塞' spider = TBSpider(q=q) spider.start_spider() print('Total : ',len(lst))
In [12]:
main()
Getting content 0 Getting content 1 Getting content 2 Getting content 3 Getting content 4 Getting content 5 Getting content 6 Getting content 7 Getting content 8 Getting content 9 Getting content 10 Getting content 11 Getting content 12 Getting content 13 Getting content 14 Getting content 15 Getting content 16 Getting content 17 Getting content 18 Getting content 19 Getting content 20 Getting content 21 Getting content 22 Getting content 23 Getting content 24 Getting content 25 Getting content 26 Getting content 27 Getting content 28 Getting content 29 Getting content 30 Getting content 31 Getting content 32 Getting content 33 Getting content 34 Getting content 35 Getting content 36 Getting content 37 Getting content 38 Getting content 39 Getting content 40 Getting content 41 Getting content 42 Getting content 43 Getting content 44 Getting content 45 Getting content 46 Getting content 47 Getting content 48 Getting content 49 Getting content 50 Getting content 51 Getting content 52 Getting content 53 Getting content 54 Getting content 55 Getting content 56 Getting content 57 Getting content 58 Getting content 59 Getting content 60 Getting content 61 Getting content 62 Getting content 63 Getting content 64 Getting content 65 Getting content 66 Getting content 67 Getting content 68 Getting content 69 Getting content 70 Getting content 71 Getting content 72 Getting content 73 Getting content 74 Getting content 75 Getting content 76 Getting content 77 Getting content 78 Getting content 79 Getting content 80 Getting content 81 Getting content 82 Getting content 83 Getting content 84 Getting content 85 Getting content 86 Getting content 87 Getting content 88 Getting content 89 Getting content 90 Getting content 91 Getting content 92 Getting content 93 Getting content 94 Getting content 95 Getting content 96 Getting content 97 Getting content 98 Getting content 99 Total : 4391
In [13]:
import pandas as pd df = pd.read_csv('tbdata.txt') df.head()
Out[13]:
| 编号 | 标题 | 价格 | 产地 | 销售量 | 店铺 | |
|---|---|---|---|---|---|---|
| 0 | 559107214139 | 红酒塞抽真空瓶塞红酒瓶塞不锈钢塞葡萄酒瓶 | 19.9 | 广东 佛山 | 651 | 优腾家居旗舰店 |
| 1 | 16265739823 | cheer启尔红酒塞 抽真空瓶塞不锈钢葡萄酒红酒塞子瓶盖红酒瓶塞子 | 28.0 | 广东 广州 | 2707 | cheer启尔旗舰店 |
| 2 | 44938761025 | 红酒塞葡萄酒瓶塞红酒瓶塞抽真空瓶塞抽气软木塞子家用创意密封塞 | 8.8 | 浙江 金华 | 2513 | 觉美家居专营店 |
| 3 | 564934681437 | AnchorChef多功能抽真空机收纳袋真空保鲜袋盒罐酒瓶塞电动充气泵 | 298.0 | 广东 深圳 | 4 | kwantsui |
| 4 | 560575298738 | 天然玛瑙红酒塞子 居家用品红酒酒瓶塞真空瓶塞抽气器红酒塞子 | 99.0 | 广东 深圳 | 1 | 凯丝58 |

加载中…