加载中…
个人资料
荷戈士
荷戈士
  • 博客等级:
  • 博客积分:0
  • 博客访问:5,069
  • 关注人气:90
  • 获赠金笔:0支
  • 赠出金笔:0支
  • 荣誉徽章:
正文 字体大小:

matlab 网络爬虫 urlread2

(2018-06-07 20:29:40)
标签:

matlab爬虫

分类: Matlab
很多朋友没有安装matlab 2016b及以上的版本,因此没有webread这个函数,而在mathworks上发现了一个urlread的扩展函数urlread2函数,连接urlread2  ,这是大牛Jim Hokanson 利用java编写的,参见Expanding urlread capabilities

【Introduction 】简介
HTTP is the underlying computer networking protocol that enables us to read webpages on the Internet. It consists of a request made by the user to an Internet server (typically located via URL), and a response from that server. Importantly, the request and response consist of three main parts: a resource line (for requests) or status line (for responses), followed by headers, and optionally a message body.

HTTP是使我们能够在因特网上读取网页的底层计算机网络协议。它由用户向互联网服务器(通常通过URL定位)和来自该服务器的响应组成的请求。重要的是,请求和响应包括三个主要部分:资源行(用于请求)或状态行(用于响应),接着是报头,并且可选地是消息体。

Matlab’s built-in urlread function enables Matlab users to easily read the server’s response text into a Matlab string:

text = urlread('http://www.google.com');

MATLAB的内置urlread函数使MATLAB用户能够轻松地读取服务器的响应文本到MATLAB字符串中:

This is done internally using Java code that connects to the specified URL and reads the information sent by the URL’s server (more on this).

这是内部使用的java代码,连接到指定的URL和读取的URL的服务器发送的信息。

urlread accepts optional additional inputs specifying the request type (‘get’ or ‘post’) and parameter values for the request.

urlread接受可选的附加输入,指定请求类型(“get”或“post”)和请求的参数值。

Unfortunately, urlread has the following limitations:

It does not allow specification of request headers
It makes assumptions as to the request headers needed based on the input method
It does not expose the response headers and status line
It assumes the response body contains text, and not a binary payload
It does not enable uploading binary contents to the server
It does not enable specifying a timeout in case the server is not responding

不幸的是,URLRead有以下限制:
1.它不允许请求报头的规范。
2.它根据输入法对所需的请求报头进行假设。
3.它不公开响应标题和状态行。
4.它假定响应体包含文本,而不是二进制有效载荷。
5.它无法将二进制内容上传到服务器。
6.如果服务器不响应,则无法启用指定超时。

urlread2
The urlread2 function addresses all of these problems. The overall design decision for this function was to make it more general, requiring more work up front to use in some cases, but more flexibility.

urlread函数解决了所有这些问题。该功能的总体设计决定是使其更通用,需要在某些情况下使用更多的工作,但更灵活。

【语法结构】
For reference, the following is the calling format for urlread2 (which is reminiscent of urlread‘s):

urlread2(url,*method,*body,*headersIn, varargin)
The * indicate optional inputs that must be spatially maintained.

url – (string), url to request
method – (string, default GET) HTTP request method
body – (string, default ”), body of the request
headersIn – (structure, default []), see the following section
varargin – extra properties that need to be specified via property/pair values

Addressing Problem 1 – Request header
urlread internally uses a Java object called urlConnection that is generally an instance of the class sun.net.www.protocol.http.HttpURLConnection. The method setRequestProperty() can be used to set headers for the request. This method has two inputs, the header name and the value of that header. A simple example of this can be seen below:

urlConnection.setRequestProperty('Content-Type','application/x-www-form-urlencoded');
Here ‘Content-Type’ is the header name and the second input is the value of that property. My function requires passing in nearly all headers as a structure array, with fields for the name and value. The preceding header would be created using a helper function http_createHeader.m:

header = http_createHeader('Content-Type','application/x-www-form-urlencoded');
Multiple headers can be passed in to the function by concatenating header structures into a structure array.

Addressing Problem 2 – Request parameters
When making a POST request, parameters are generally specified in the message body using the following format:

[property]=[value]&[property]=[value]

The properties and values are also encoded in a particular way, generally termed urlencoded (encoding and decoding can be done using Matlab’s built-in urlencode and urldecode functions). For GET requests this string is appended to the url with the “?” symbol. Since urlencoding methods can vary, and in the spirit of reducing assumptions, I use separate functions to generate these strings outside of urlread2, and then pass the result in either as the url (for GET) or as the body input (for POST). As an example, I might search the Mathworks website using the upper right search bar on its site for “undocumented matlab” under file exchange (hmmm… pretty cute stuff there!). Doing this performs a GET request with the following property/value pairs:

params = {'search_submit','fileexchange', 'term','undocumented matlab', 'query','undocumented matlab'};
These property/value pairs are somewhat obvious from looking at the URL, but could also be determined by using programs such as Fiddler, Firebug, or HttpWatch.

After urlencoding and concatenating, we would form the following string:

search_submit=fileexchange&term=undocumented+matlab&query=undocumented+matlab

This functionality is normally accomplished internally in urlread, but I use a function http_paramsToString to produce that result. That function also returns the required header for POST requests. The following is an example of both GET and POST requests:

[queryString,header] = http_paramsToString(params,1);
 
% For GET:
url = [url '?' queryString];
urlread2(url)
 
% For POST:
urlread2(url,'POST',queryString,header)

Addressing Problem 3 – Response header
According to the HTTP protocol, each server response starts with a simple header that indicates a numeric response status. The following Matlab code provides access to the status line using the urlConnection object:

status = struct('value',urlConnection.getResponseCode(), 'msg',char(urlConnection.getResponseMessage))
status = 
    value: 200
      msg: 'OK'
urlConnection‘s getHeaderField() and getHeaderFieldKey() methods enable reading the specific parts of the response header:

headerValue = char(urlConnection.getHeaderField(headerIndex));
headerName  = char(urlConnection.getHeaderFieldKey(headerIndex));
headerIndex starts at 0 and increases by 1 until both headerValue and headerName return empty.

It is important to note that header keys (names) can be repeated for different values. Sometimes this is desired, such as if there are multiple cookies being sent to the user. To generically handle this case, two header structures are returned. In both cases the header names are the field names in the structure, after replacing hyphens with underscores. In one case, allHeaders, the values are cell arrays of strings containing all values presented with the particular key. The other structure, firstHeaders, contains only the first instance of the header as a string to avoid needing to dereference a cell array.

Addressing Problem 4 – Response body
urlread assumes text output. This is fine for most webpages, which use HTML and are therefore text-based. However, urlread fails when trying to download any non-text resource such as an image, a ZIP file, or a PDF document. I have added a flag in urlread2 called CAST_OUTPUT, which defaults to true, i.e. text response, just as urlread assumes. Using varargin, this flag can be set to false ({‘CAST_OUTPUT’,false}) to indicate a binary response.

Summary
urlread2‘s functionality has been expanded to also address other limitations of urlread: It enables binary inputs, better character-set handling of the output, redirection following, and read timeouts.

The modifications described above provide direct access to the key components of the HTTP request and response messages. Its more generic nature lets urlread2 focus on HTTP transmission, and leaves request formation and response interpretation up to the user. I think ultimately this approach is better than providing one-off modifications of the original urlread function to suit a particular need. urlread2 and supporting files can be found on the Matlab File Exchange.

Related posts:

Inactive Control Tooltips & Event Chaining – Inactive Matlab uicontrols cannot normally display their tooltips. This article shows how to do this with a combination of undocumented Matlab and Java hacks....
GUI automation using a Robot – This article explains how Java's Robot class can be used to programmatically control mouse and keyboard actions...
Matlab installation woes – Matlab has some issues when installing a new version. This post discusses some of them and how to overcome them....
Matlab-Java memory leaks, performance – Internal fields of Java objects may leak memory - this article explains how to avoid this without sacrificing performance. ...
File deletion memory leaks, performance – Matlab's delete function leaks memory and is also slower than the equivalent Java function. ...
JGraph in Matlab figures – JGraph is a powerful open-source Java library that can easily be integrated in Matlab figures. ...

0

阅读 收藏 喜欢 打印举报/Report
  

新浪BLOG意见反馈留言板 欢迎批评指正

新浪简介 | About Sina | 广告服务 | 联系我们 | 招聘信息 | 网站律师 | SINA English | 产品答疑

新浪公司 版权所有