Python数据分析从入门到放弃（十二）爬虫四——爬取淘宝

(2018-04-26 09:30:35)

标签：

python

爬虫

入门

数据分析

淘宝

分类： Python

Python数据分析从入门到放弃（十二）爬虫四——爬取淘宝

任务、难点及解决方法

从淘宝爬取某种商品的信息
部分网站，例如今天的淘宝，无法用xpath解析，数据在script中给出
考虑使用正则表达式来提取数据

实现步骤

可以考虑在pycharm中完成本实验
工具就用以前用到过的lxml和requests

需要注意的部分

用一个全局的list，存放商品编号，以防爬取时重复
翻页技巧，根据URL判断翻页的数字
requests.get中参数params的用法，它是一个字典类型，需要给url的参数，不用每次都+++的
正则表达式 re 的用法

In [9]:


import requests
import re

lst = [] #存放商品信息，以防重复

In [10]:


class TBSpider: #淘宝爬虫类
    url = 'https://s.taobao.com/search'
    payload = {'q': 'python', 's': '1', 'ie': 'utf8'}  # 字典传递url参数
    file_name = 'tbdata.txt' #输出文件，就是csv格式，后缀名不是问题
    fl=None

    # 初始化，建立文件标题行，q为需要查询的物品信息
    def __init__(self,q=''):
        self.url='https://s.taobao.com/search'
        self.payload['q']=q
        self.fl = open(self.file_name,'w',encoding='utf-8')
        self.fl.writelines('"编号","标题","价格","产地","销售量","店铺"\n')

    #在结束时关闭文件
    def __del__(self):
        self.fl.close()

    #开始爬虫
    def start_spider(self):
        page_count = self.get_page_count('')
        for i in range(0,page_count):
            self.payload['s']=44*i #注意翻页技巧，根据URL自行判断
            resp = requests.get(self.url, params=self.payload)
            resp.encoding='utf-8'
            self.get_content(html=resp.text)
            print('Getting content ',i)


    def get_content(self,html):
        try:
            title = re.findall(r'"raw_title":"([^"]+)"', html, re.I)
            loc = re.findall(r'"item_loc":"([^"]+)"',html,re.I)
            price = re.findall(r'"view_price":"([^"]+)"', html, re.I)
            nid = re.findall(r'"nid":"([^"]+)"', html, re.I)
            sales = re.findall(r'"view_sales":"([^"]+)"', html, re.I)
            nick = re.findall(r'"nick":"([^"]+)"', html, re.I)
            k = len(title)
            for i in range(0,k):
                if nid[i] in lst:
                    continue
                lst.append(nid[i])
                p = sales[i][:-3]
                str = '"' + nid[i] + '",'
                str += '"' + title[i] + '",'
                str += '"' + price[i] + '",'
                str += '"' + loc[i] + '",'
                str += '"' + p + '",'
                str += '"' + nick[i] + '"\n'
                self.fl.writelines(str)
        except:
            print('Error ')

    def get_page_count(self,html):
        return 100

In [11]:


def main():
    q='红酒真空瓶塞'
    spider = TBSpider(q=q)
    spider.start_spider()
    print('Total : ',len(lst))

In [12]:


main()


Getting content  0
Getting content  1
Getting content  2
Getting content  3
Getting content  4
Getting content  5
Getting content  6
Getting content  7
Getting content  8
Getting content  9
Getting content  10
Getting content  11
Getting content  12
Getting content  13
Getting content  14
Getting content  15
Getting content  16
Getting content  17
Getting content  18
Getting content  19
Getting content  20
Getting content  21
Getting content  22
Getting content  23
Getting content  24
Getting content  25
Getting content  26
Getting content  27
Getting content  28
Getting content  29
Getting content  30
Getting content  31
Getting content  32
Getting content  33
Getting content  34
Getting content  35
Getting content  36
Getting content  37
Getting content  38
Getting content  39
Getting content  40
Getting content  41
Getting content  42
Getting content  43
Getting content  44
Getting content  45
Getting content  46
Getting content  47
Getting content  48
Getting content  49
Getting content  50
Getting content  51
Getting content  52
Getting content  53
Getting content  54
Getting content  55
Getting content  56
Getting content  57
Getting content  58
Getting content  59
Getting content  60
Getting content  61
Getting content  62
Getting content  63
Getting content  64
Getting content  65
Getting content  66
Getting content  67
Getting content  68
Getting content  69
Getting content  70
Getting content  71
Getting content  72
Getting content  73
Getting content  74
Getting content  75
Getting content  76
Getting content  77
Getting content  78
Getting content  79
Getting content  80
Getting content  81
Getting content  82
Getting content  83
Getting content  84
Getting content  85
Getting content  86
Getting content  87
Getting content  88
Getting content  89
Getting content  90
Getting content  91
Getting content  92
Getting content  93
Getting content  94
Getting content  95
Getting content  96
Getting content  97
Getting content  98
Getting content  99
Total :  4391

In [13]:


import pandas as pd
df = pd.read_csv('tbdata.txt')
df.head()

Out[13]:

	编号	标题	价格	产地	销售量	店铺
0	559107214139	红酒塞抽真空瓶塞红酒瓶塞不锈钢塞葡萄酒瓶	19.9	广东佛山	651	优腾家居旗舰店
1	16265739823	cheer启尔红酒塞抽真空瓶塞不锈钢葡萄酒红酒塞子瓶盖红酒瓶塞子	28.0	广东广州	2707	cheer启尔旗舰店
2	44938761025	红酒塞葡萄酒瓶塞红酒瓶塞抽真空瓶塞抽气软木塞子家用创意密封塞	8.8	浙江金华	2513	觉美家居专营店
3	564934681437	AnchorChef多功能抽真空机收纳袋真空保鲜袋盒罐酒瓶塞电动充气泵	298.0	广东深圳	4	kwantsui
4	560575298738	天然玛瑙红酒塞子居家用品红酒酒瓶塞真空瓶塞抽气器红酒塞子	99.0	广东深圳	1	凯丝58

阅读┊ 收藏 ┊ 喜欢 ▼ ┊打印┊举报/Report

前一篇：Python数据分析从入门到放弃（十一）爬虫三——用Scrapy爬取完本小说¶

后一篇：Python数据分析从入门到放弃（十三）番外，瞎玩淘宝数据

新浪BLOG意见反馈留言板　欢迎批评指正