加载中…
个人资料
  • 博客等级:
  • 博客积分:
  • 博客访问:
  • 关注人气:
  • 获赠金笔:0支
  • 赠出金笔:0支
  • 荣誉徽章:
正文 字体大小:

Python 机器学习记录(1)——爬虫获取互联网文本,获取Facebook Comments

(2017-05-10 12:50:33)
标签:

python

爬虫

数据挖掘

分类: python爬虫
开始学习记录




首先分析facebook给出的数据
如下面数据



 NormalText Code 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
-*- coding: utf-8 -*-
"""
Created on Mon May  15:29:33 2017

"""

#import PyQuery as pq
import requests
import pandas as pd
import time
import json
headers {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
}

requests.session()
#s.proxies proxies
login_data {'email':'自己的email', 'pass': '自己的password', }
post 数据实现登录
s.post('https://www.facebook.com/',headers=headers, data=login_data)

token '需要自己通过登录https://developers.facebook.com/tools/explorer/获取'
#设置获取hotlink的feed,设置时间逆序order=reverse_chronological,选择多层commet,设置offset,设置限制读取post数
url 'https://graph.facebook.com/hotlink/feed?order=reverse_chronological&filter=stream&offset='
limitstr='&limit='
tokenstr='&access_token='
totalnum=100000
limit=20
Posts_list []
Comments_list=[]
for in range(0,totalnum):
    fullurl=url+str(a*limit)+limitstr+str(limit)+tokenstr+token
    res s.get(fullurl)
    b=0
    
   print (res.text)
    if "error" in res.json().keys():
        print (res.text)
        break
    if len(res.json()["data"])==0:
        break
    for ele in res.json()["data"]:
        c=0
        Posts_list.append([ele['created_time'],ele['id'],ele['from']['id'], ele['from']['name'], ele['message'] if 'message' in ele.keys() else "",ele['shares']['count'] if 'shares' in ele.keys() else 0,len(ele['likes']['data']) if 'likes' in ele.keys() else 0])
        if 'comments' in ele.keys():
            for clc in ele["comments"]["data"]:
                Comments_list.append([ele['id'],clc['created_time'],clc['from']['id'], clc['from']['name'], clc['message'],clc['like_count']])
            if 'paging' in ele['comments'].keys():
                if 'next' in ele['comments']['paging'].keys():
                    comments_nexpage=ele['comments']['paging']['next']
                else:
                    comments_nexpage=""                
                while comments_nexpage!="":                    
                    rcs=s.get(comments_nexpage)
                    if 'data'in rcs.json().keys():
                        if len(res.json()["data"])==0:
                            comments_nexpage=""
                        for clc in rcs.json()["data"]:
                            Comments_list.append([ele['id'],clc['created_time'],clc['from']['id'], clc['from']['name'], clc['message'],clc['like_count']])
                        if 'paging' in rcs.json().keys():
                            if 'next' in rcs.json()['paging'].keys():
                                comments_nexpage=rcs.json()['paging']['next']
                            else:
                                comments_nexpage=""
                        else:
                            comments_nexpage=""
                        print ("进行完第"+str(a)+":"+str(b)+":"+str(c)+"个\r\n")
                        print ("完成Comments"+str(len(Comments_list))+"个")                        
                    else:
                        comments_nexpage=""
                    c=c+1
                   time.sleep(1)
        b=b+1
    print ("进行完第"+str(a)+"个\r\n")
    a=a+1
   time.sleep(1)
        
dfp pd.DataFrame(Posts_list, columns=['time', 'PostID','userID', 'username', 'PostContent','ShareCounts','likeCounts'])
dfc pd.DataFrame(Comments_list, columns=['PostID','time', 'userID', 'username', 'CommentsContent','likeCounts'])

dfp.to_csv("自己保存的位置hotlinkfacebookposts.csv")
dfc.to_csv("自己保存的位置hotlinkfacebookcomments.csv")




0

阅读 收藏 喜欢 打印举报/Report
  

新浪BLOG意见反馈留言板 欢迎批评指正

新浪简介 | About Sina | 广告服务 | 联系我们 | 招聘信息 | 网站律师 | SINA English | 产品答疑

新浪公司 版权所有