网络爬虫爬取人民日报的新闻内容
(2016-10-25 07:45:15)| 分类: 自然语言处理 |
心血来潮,想从网上爬取一些数据进行分析,思量几日,决定从中国人民日报网上爬取一些数据,然后进行分析
具体的代码如下:
#coding:utf-8 -*-
import urllib2
##打开网页
"""
##方法一:
response = urllib2.urlopen('http://www.bing.com')
print type(response)
import urllib2
##打开网页
"""
##方法一:
response = urllib2.urlopen('http://www.bing.com')
print type(response)
##方法二,建议这么写
##抓取人民日报的首页的信息
request = urllib2.Request('http://paper.people.com.cn/rmrb/html/2015-01/01/nbs.D110000renmrb_01.htm')
response = urllib2.urlopen(request)
print response.read()
"""
##http://paper.people.com.cn/rmrb/html/2015-01/01/nw.D110000renmrb_20150101_1-01.htm
##http://paper.people.com.cn/rmrb/html/2015-01/01/nw.D110000renmrb_20150101_2-01.htm
##http://paper.people.com.cn/rmrb/html/2015-01/01/nbs.D110000renmrb_01.htm
##http://paper.people.com.cn/rmrb/html/2015-01/02/nbs.D110000renmrb_01.htm
##http://paper.people.com.cn/rmrb/html/2015-01/03/nbs.D110000renmrb_01.htm
##http://paper.people.com.cn/rmrb/html/2015-01/31/nbs.D110000renmrb_01.htm
f1 = open('1_3_yue.txt','w')
list3 = ['01','02','03','04','05','06','07','08','09','10','11','12','13','14','15',
'16','17','18','19','20','21','22','23','24','25','26','27','28','29','30','31']
list3 = ['25','26','27','28','29','30','31']
for ll in list3:
print ll
str1 =
'http://paper.people.com.cn/rmrb/html/2015-01/'+ll+'/'
str2 = 'nbs.D110000renmrb_01.htm'
request = urllib2.Request(str1+str2)
response = urllib2.urlopen(request)
list1 = []
for line in response.readlines():
line = line.strip()
if line == '':continue
if line.find('
hang = 0
for net in list1:
hang += 1
if hang == 5:
break
print net
str3 = net
request = urllib2.Request(str1+str3)
response = urllib2.urlopen(request)
for line in response.readlines():
line = line.strip()
if line == '':
continue
if line.find('))') != -1:
line = line.split('=')[1].split('>')[0]
request1 = urllib2.Request(str1+line)
response1 = urllib2.urlopen(request1)
for each in response1.readlines():
each = each.strip()
if each.find('
##抓取人民日报的首页的信息
request = urllib2.Request('http://paper.people.com.cn/rmrb/html/2015-01/01/nbs.D110000renmrb_01.htm')
response = urllib2.urlopen(request)
print response.read()
"""
##http://paper.people.com.cn/rmrb/html/2015-01/01/nw.D110000renmrb_20150101_1-01.htm
##http://paper.people.com.cn/rmrb/html/2015-01/01/nw.D110000renmrb_20150101_2-01.htm
##http://paper.people.com.cn/rmrb/html/2015-01/01/nbs.D110000renmrb_01.htm
##http://paper.people.com.cn/rmrb/html/2015-01/02/nbs.D110000renmrb_01.htm
##http://paper.people.com.cn/rmrb/html/2015-01/03/nbs.D110000renmrb_01.htm
##http://paper.people.com.cn/rmrb/html/2015-01/31/nbs.D110000renmrb_01.htm
f1 = open('1_3_yue.txt','w')
list3 = ['01','02','03','04','05','06','07','08','09','10','11','12','13','14','15',
'16','17','18','19','20','21','22','23','24','25','26','27','28','29','30','31']
list3 = ['25','26','27','28','29','30','31']
for ll in list3:
') != -1:
')
f1.close()
题外之化:由于是执行过的代码,所以会有点比较乱,但是肯定是能执行的代码,没有错误的,如若引用之后出现了什么错误,可以给我留言,我看看能不能解决。
由于本人是刚开始写技术类的博客不久,希望有什么不妥或者建议之类的可以留言一起进行交流!
前一篇:深度学习系列--1
后一篇:shell编程基础1

加载中…