Python 机器学习记录(1)——爬虫获取互联网文本，获取Facebook Comments_zzenx

http://blog.sina.com.cn/u/1402169527

首页博文目录关于我

个人资料

微博

加好友发纸条

写留言加关注

博客等级：
博客积分：

博客访问：
关注人气：
获赠金笔：0支
赠出金笔：0支
荣誉徽章：

正文字体大小：大中小

Python 机器学习记录(1)——爬虫获取互联网文本，获取Facebook Comments

(2017-05-10 12:50:33)

标签：

python

爬虫

数据挖掘

分类： python爬虫

开始学习记录

首先分析facebook给出的数据

如下面数据

 NormalText
Code 

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84

# -*- coding: utf-8 -*-

"""

Created on Mon May  8 15:29:33 2017

"""

#import PyQuery as pq

import requests

import pandas as pd

import time

import json

headers = {

    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'

s = requests.session()

#s.proxies = proxies

login_data = {'email':'自己的email', 'pass': '自己的password', }

# post 数据实现登录

s.post('https://www.facebook.com/',headers=headers, data=login_data)

token = '需要自己通过登录https://developers.facebook.com/tools/explorer/获取'

#设置获取hotlink的feed，设置时间逆序order=reverse_chronological，选择多层commet，设置offset，设置限制读取post数

url = 'https://graph.facebook.com/hotlink/feed?order=reverse_chronological&filter=stream&offset='

limitstr='&limit='

tokenstr='&access_token='

totalnum=100000

limit=20

Posts_list = []

Comments_list=[]

for a in range(0,totalnum):

    fullurl=url+str(a*limit)+limitstr+str(limit)+tokenstr+token

    res = s.get(fullurl)

b=0

#    print (res.text)

    if "error" in res.json().keys():

        print (res.text)

        break

    if len(res.json()["data"])==0:

        break

    for ele in res.json()["data"]:

c=0

        Posts_list.append([ele['created_time'],ele['id'],ele['from']['id'], ele['from']['name'], ele['message'] if 'message' in ele.keys() else "",ele['shares']['count'] if 'shares' in ele.keys() else 0,len(ele['likes']['data']) if 'likes' in ele.keys() else 0])

        if 'comments' in ele.keys():

            for clc in ele["comments"]["data"]:

                Comments_list.append([ele['id'],clc['created_time'],clc['from']['id'], clc['from']['name'], clc['message'],clc['like_count']])

            if 'paging' in ele['comments'].keys():

                if 'next' in ele['comments']['paging'].keys():

                    comments_nexpage=ele['comments']['paging']['next']

                else:

                    comments_nexpage=""

                while comments_nexpage!="":

                    rcs=s.get(comments_nexpage)

                    if 'data'in rcs.json().keys():

                        if len(res.json()["data"])==0:

                            comments_nexpage=""

                        for clc in rcs.json()["data"]:

                            Comments_list.append([ele['id'],clc['created_time'],clc['from']['id'], clc['from']['name'], clc['message'],clc['like_count']])

                        if 'paging' in rcs.json().keys():

                            if 'next' in rcs.json()['paging'].keys():

                                comments_nexpage=rcs.json()['paging']['next']

                            else:

                                comments_nexpage=""

                        else:

                            comments_nexpage=""

                        print ("进行完第"+str(a)+":"+str(b)+":"+str(c)+"个\r\n")

                        print ("完成Comments"+str(len(Comments_list))+"个")

                    else:

                        comments_nexpage=""

                    c=c+1

#                    time.sleep(1)

        b=b+1

    print ("进行完第"+str(a)+"个\r\n")

    a=a+1

#    time.sleep(1)

dfp = pd.DataFrame(Posts_list, columns=['time', 'PostID','userID', 'username', 'PostContent','ShareCounts','likeCounts'])

dfc = pd.DataFrame(Comments_list, columns=['PostID','time', 'userID', 'username', 'CommentsContent','likeCounts'])

dfp.to_csv("自己保存的位置hotlinkfacebookposts.csv")

dfc.to_csv("自己保存的位置hotlinkfacebookcomments.csv")

阅读┊ 收藏 ┊ 喜欢 ▼ ┊打印┊举报/Report

前一篇：时隔一年多，菲菲的照片都由妈妈发到了微博上了。

新浪BLOG意见反馈留言板　欢迎批评指正