网络爬虫爬取人民日报的新闻内容_孤独的行驶者

http://blog.sina.com.cn/u/2600724971

首页博文目录关于我

个人资料

微博

加好友发纸条

写留言加关注

博客等级：
博客积分：

博客访问：
关注人气：
获赠金笔：0支
赠出金笔：0支
荣誉徽章：

正文字体大小：大中小

网络爬虫爬取人民日报的新闻内容

(2016-10-25 07:45:15)

分类：自然语言处理

心血来潮，想从网上爬取一些数据进行分析，思量几日，决定从中国人民日报网上爬取一些数据，然后进行分析

具体的代码如下：

#coding:utf-8 -*-
import urllib2
##打开网页
"""
##方法一：
response = urllib2.urlopen('http://www.bing.com')
print type(response)

##方法二，建议这么写
##抓取人民日报的首页的信息
request = urllib2.Request('http://paper.people.com.cn/rmrb/html/2015-01/01/nbs.D110000renmrb_01.htm')
response = urllib2.urlopen(request)
print response.read()
"""
##http://paper.people.com.cn/rmrb/html/2015-01/01/nw.D110000renmrb_20150101_1-01.htm
##http://paper.people.com.cn/rmrb/html/2015-01/01/nw.D110000renmrb_20150101_2-01.htm
##http://paper.people.com.cn/rmrb/html/2015-01/01/nbs.D110000renmrb_01.htm
##http://paper.people.com.cn/rmrb/html/2015-01/02/nbs.D110000renmrb_01.htm
##http://paper.people.com.cn/rmrb/html/2015-01/03/nbs.D110000renmrb_01.htm
##http://paper.people.com.cn/rmrb/html/2015-01/31/nbs.D110000renmrb_01.htm
f1 = open('1_3_yue.txt','w')
list3 = ['01','02','03','04','05','06','07','08','09','10','11','12','13','14','15',
'16','17','18','19','20','21','22','23','24','25','26','27','28','29','30','31']
list3 = ['25','26','27','28','29','30','31']
for ll in list3:
print ll
str1 = 'http://paper.people.com.cn/rmrb/html/2015-01/'+ll+'/'
str2 = 'nbs.D110000renmrb_01.htm'
request = urllib2.Request(str1+str2)
response = urllib2.urlopen(request)
list1 = []
for line in response.readlines():
     line = line.strip()
     if line == '':continue
     if line.find('

')[0]
list1.append(line)
list1[0] = list1[0].split('./')[-1]

hang = 0
for net in list1:
     hang += 1
     if hang == 5:
  break
     print net
     str3 = net
     request = urllib2.Request(str1+str3)
     response = urllib2.urlopen(request)
     for line in response.readlines():
  line = line.strip()
  if line == '':
      continue
  if line.find('))') != -1:
      line = line.split('=')[1].split('>')[0]
      request1 = urllib2.Request(str1+line)
      response1 = urllib2.urlopen(request1)
      for each in response1.readlines():
          each = each.strip()
          if each.find('

') != -1:
each = each.split('

')
f1.write(each[-1]+'\n')
break
f1.close()

题外之化：由于是执行过的代码，所以会有点比较乱，但是肯定是能执行的代码，没有错误的，如若引用之后出现了什么错误，可以给我留言，我看看能不能解决。

由于本人是刚开始写技术类的博客不久，希望有什么不妥或者建议之类的可以留言一起进行交流！

阅读┊ 收藏 ┊ 喜欢 ▼ ┊打印┊举报/Report

前一篇：深度学习系列--1

后一篇：shell编程基础1

新浪BLOG意见反馈留言板　欢迎批评指正