加载中…
个人资料
  • 博客等级:
  • 博客积分:
  • 博客访问:
  • 关注人气:
  • 获赠金笔:0支
  • 赠出金笔:0支
  • 荣誉徽章:
正文 字体大小:

Python数据分析从入门到放弃(十二)爬虫四——爬取淘宝

(2018-04-26 09:30:35)
标签:

python

爬虫

入门

数据分析

淘宝

分类: Python

Python数据分析从入门到放弃(十二)爬虫四——爬取淘宝

任务、难点及解决方法

  • 从淘宝爬取某种商品的信息
  • 部分网站,例如今天的淘宝,无法用xpath解析,数据在script中给出
  • 考虑使用正则表达式来提取数据

实现步骤

  • 可以考虑在pycharm中完成本实验
  • 工具就用以前用到过的lxml和requests

需要注意的部分

  • 用一个全局的list,存放商品编号,以防爬取时重复
  • 翻页技巧,根据URL判断翻页的数字
  • requests.get中参数params的用法,它是一个字典类型,需要给url的参数,不用每次都+++的
  • 正则表达式 re 的用法
In [9]:

import requests
import re

lst = [] #存放商品信息,以防重复
In [10]:

class TBSpider: #淘宝爬虫类
    url = 'https://s.taobao.com/search'
    payload = {'q': 'python', 's': '1', 'ie': 'utf8'}  # 字典传递url参数
    file_name = 'tbdata.txt' #输出文件,就是csv格式,后缀名不是问题
    fl=None

    # 初始化,建立文件标题行,q为需要查询的物品信息
    def __init__(self,q=''):
        self.url='https://s.taobao.com/search'
        self.payload['q']=q
        self.fl = open(self.file_name,'w',encoding='utf-8')
        self.fl.writelines('"编号","标题","价格","产地","销售量","店铺"\n')

    #在结束时关闭文件
    def __del__(self):
        self.fl.close()

    #开始爬虫
    def start_spider(self):
        page_count = self.get_page_count('')
        for i in range(0,page_count):
            self.payload['s']=44*i #注意翻页技巧,根据URL自行判断
            resp = requests.get(self.url, params=self.payload)
            resp.encoding='utf-8'
            self.get_content(html=resp.text)
            print('Getting content ',i)


    def get_content(self,html):
        try:
            title = re.findall(r'"raw_title":"([^"]+)"', html, re.I)
            loc = re.findall(r'"item_loc":"([^"]+)"',html,re.I)
            price = re.findall(r'"view_price":"([^"]+)"', html, re.I)
            nid = re.findall(r'"nid":"([^"]+)"', html, re.I)
            sales = re.findall(r'"view_sales":"([^"]+)"', html, re.I)
            nick = re.findall(r'"nick":"([^"]+)"', html, re.I)
            k = len(title)
            for i in range(0,k):
                if nid[i] in lst:
                    continue
                lst.append(nid[i])
                p = sales[i][:-3]
                str = '"' + nid[i] + '",'
                str += '"' + title[i] + '",'
                str += '"' + price[i] + '",'
                str += '"' + loc[i] + '",'
                str += '"' + p + '",'
                str += '"' + nick[i] + '"\n'
                self.fl.writelines(str)
        except:
            print('Error ')

    def get_page_count(self,html):
        return 100
In [11]:

def main():
    q='红酒真空瓶塞'
    spider = TBSpider(q=q)
    spider.start_spider()
    print('Total : ',len(lst))
In [12]:

main()

Getting content  0
Getting content  1
Getting content  2
Getting content  3
Getting content  4
Getting content  5
Getting content  6
Getting content  7
Getting content  8
Getting content  9
Getting content  10
Getting content  11
Getting content  12
Getting content  13
Getting content  14
Getting content  15
Getting content  16
Getting content  17
Getting content  18
Getting content  19
Getting content  20
Getting content  21
Getting content  22
Getting content  23
Getting content  24
Getting content  25
Getting content  26
Getting content  27
Getting content  28
Getting content  29
Getting content  30
Getting content  31
Getting content  32
Getting content  33
Getting content  34
Getting content  35
Getting content  36
Getting content  37
Getting content  38
Getting content  39
Getting content  40
Getting content  41
Getting content  42
Getting content  43
Getting content  44
Getting content  45
Getting content  46
Getting content  47
Getting content  48
Getting content  49
Getting content  50
Getting content  51
Getting content  52
Getting content  53
Getting content  54
Getting content  55
Getting content  56
Getting content  57
Getting content  58
Getting content  59
Getting content  60
Getting content  61
Getting content  62
Getting content  63
Getting content  64
Getting content  65
Getting content  66
Getting content  67
Getting content  68
Getting content  69
Getting content  70
Getting content  71
Getting content  72
Getting content  73
Getting content  74
Getting content  75
Getting content  76
Getting content  77
Getting content  78
Getting content  79
Getting content  80
Getting content  81
Getting content  82
Getting content  83
Getting content  84
Getting content  85
Getting content  86
Getting content  87
Getting content  88
Getting content  89
Getting content  90
Getting content  91
Getting content  92
Getting content  93
Getting content  94
Getting content  95
Getting content  96
Getting content  97
Getting content  98
Getting content  99
Total :  4391
In [13]:

import pandas as pd
df = pd.read_csv('tbdata.txt')
df.head()
Out[13]:
编号 标题 价格 产地 销售量 店铺
0 559107214139 红酒塞抽真空瓶塞红酒瓶塞不锈钢塞葡萄酒瓶 19.9 广东 佛山 651 优腾家居旗舰店
1 16265739823 cheer启尔红酒塞 抽真空瓶塞不锈钢葡萄酒红酒塞子瓶盖红酒瓶塞子 28.0 广东 广州 2707 cheer启尔旗舰店
2 44938761025 红酒塞葡萄酒瓶塞红酒瓶塞抽真空瓶塞抽气软木塞子家用创意密封塞 8.8 浙江 金华 2513 觉美家居专营店
3 564934681437 AnchorChef多功能抽真空机收纳袋真空保鲜袋盒罐酒瓶塞电动充气泵 298.0 广东 深圳 4 kwantsui
4 560575298738 天然玛瑙红酒塞子 居家用品红酒酒瓶塞真空瓶塞抽气器红酒塞子 99.0 广东 深圳 1 凯丝58

0

阅读 收藏 喜欢 打印举报/Report
  

新浪BLOG意见反馈留言板 欢迎批评指正

新浪简介 | About Sina | 广告服务 | 联系我们 | 招聘信息 | 网站律师 | SINA English | 产品答疑

新浪公司 版权所有