简体   繁体   English

在 R 中使用 getURL() 抓取网页时如何避免空字符?

[英]How to avoid null character when scraping a web page with getURL() in R?

I have a problem when scraping a website with the getURL() function from RCurl .使用RCurlgetURL()函数抓取网站时遇到问题。 For example with http://dogecoin.com it returns an error saying that a NULL character is in the middle of the chain (litteral translation.例如,对于http://dogecoin.com,它返回一个错误,指出链中间有一个 NULL 字符(字面翻译。

> x <- getURL("http://dogecoin.com/")
Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) : 
  caractère nul au milieu de la chaîne : '\037\x8b\b\0\0\0\0\0\0\003\xed]\xebr\xdbF\x96\xfe\035=E\x9b\xa9\x89\xe4]\x82\xd4͗8\x92R\xb2|\x9d\x91-\x97\xa5\xac7\x95IiA\002$a\x81\0\x82\x8bhf2ﰯ\xb1\xaf\xb1\xfbb\xfb}\xa7\033WB\022E{\xa62\025\xa5*2I\xf4\xbdO\x9f\xcbw\xcei\xec\xdd{vrt\xf6\xe3\xbb\xe7j\x92N\xfd\x83\xb5\xbd\xfc\037\xd7v\016\xd6\024\xfeۻgY\xf2\xc1\xfcw~^\xf9\xb2\xf0\xf1\xdc\024=\xc7\177}\xd5\xc7_\xa5\xf8Y\xff\x91\xbf\xf2\x9fR\033\xe7\xf7\xf1\xaf.\xde\xc7\003\xa5\xe4\xef_\xe5\xef\177\xe1\xaf\xfe\x88V\xf4\xaf\xfa_)φ\xf1\177\xed/JI\177ů|\xae\xfaR\xfe\xaf\xe7\xe7\xdd\xf3>\xfe\xe21?+\xf9\\\xfe9Gs\xbabu\xa2ғ\xd4Y\x93\x9f\x93PM\xed\xf8"\x8bz\xeaҍ\xe7j\xe6\016\022/u{j\026\xcezR²\016\xd6־₩7nj\xcb\xf7\xaf\xf6R/\xf5݃g\xe1\xd8\035\x86^\xb0\xd7\xd7\xdf\xf1`\xca2É\035'n\xba\xdf\xc9ґ\xf5\xb8\xc3\n\xfa\xf70H\xdd\0\xbf\xe7\025\x95\x97(;Pa\xe4\006\030J\026\017]\025\xb9nl\xa5\xa1\xc5\177\x95㍽\xd4\xf6\xd50\x8bc7\030λjd_\x86\xb1\xeb\xa8\xc1\\\x9dN\xbc\x81\xad^\aY\x82\xd1

On some very rare occasions it returns a clean HTML code but most of the time I have this error.在极少数情况下,它会返回一个干净的 HTML 代码,但大多数时候我都会遇到此错误。 It seems related to their website and as you can see there are several weird characters like ₩ and 4͗.它似乎与他们的网站有关,正如你所看到的,有几个奇怪的字符,比如 ₩ 和 4͗。

An option could be to use getURLcontent() to download the raw data but then I'm unable to convert the binary content into HTML.一种选择是使用getURLcontent()下载原始数据,但随后我无法将二进制内容转换为 HTML。

I have try to change the .encoding argument but it doesn't give the expected result.我尝试更改.encoding参数,但它没有给出预期的结果。 How could I scrape this webpage ?我怎么能抓取这个网页?

EDIT : Verbose mode编辑:详细模式

> getURL("http://dogecoin.com/", verbose = TRUE)
*   Trying 192.30.252.153...
* Connected to dogecoin.com (192.30.252.153) port 80 (#0)
> GET / HTTP/1.1
Host: dogecoin.com
Accept: */*

< HTTP/1.1 200 OK
< Server: GitHub.com
< Date: Wed, 25 Oct 2017 10:12:26 GMT
< Content-Type: text/html; charset=utf-8
< Transfer-Encoding: chunked
< Last-Modified: Tue, 16 May 2017 01:27:52 GMT
< Access-Control-Allow-Origin: *
< Expires: Wed, 25 Oct 2017 10:05:08 GMT
< Cache-Control: max-age=600
< Content-Encoding: gzip
< X-GitHub-Request-Id: A4D0:66A8:93356A1:D740FF7:59F0638A
< 
Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) : 
  caractère nul au milieu de la chaîne : '\037\x8b\b\0\0\0\0\0\0\003\xed]\xebr\xdbF\x96\xfe\035=E\x9b\xa9\x89\xe4]\x82\xd4͗8\x92R\xb2|\x9d\x91-\x97\xa5\xac7\x95IiA\002$a\x81\0\x82\x8bhf2ﰯ\xb1\xaf\xb1\xfbb\xfb}\xa7\033WB\022E{\xa62\025\xa5*2I\xf4\xbdO\x9f\xcbw\xcei\xec\xdd{vrt\xf6\xe3\xbb\xe7j\x92N\xfd\x83\xb5\xbd\xfc\037\xd7v\016\xd6\024\xfeۻgY\xf2\xc1\xfcw~^\xf9\xb2\xf0\xf1\xdc\024=\xc7\177}\xd5\xc7_\xa5\xf8Y\xff\x91\xbf\xf2\x9fR\033\xe7\xf7\xf1\xaf.\xde\xc7\003\xa5\xe4\xef_\xe5\xef\177\xe1\xaf\xfe\x88V\xf4\xaf\xfa_)φ\xf1\177\xed/JI\177ů|\xae\xfaR\xfe\xaf\xe7\xe7\xdd\xf3>\xfe\xe21?+\xf9\\\xfe9Gs\xbabu\xa2ғ\xd4Y\x93\x9f\x93PM\xed\xf8"\x8bz\xeaҍ\xe7j\xe6\016\022/u{j\026\xcezR²\016\xd6־₩7nj\xcb\xf7\xaf\xf6R/\xf5݃g\xe1\xd8\035\x86^\xb0\xd7\xd7\xdf\xf1`\xca2É\035'n\xba\xdf\xc9ґ\xf5\xb8\xc3\n\xfa\xf70H\xdd\0\xbf\xe7\025\x95\x97(;Pa\xe4\006\030J\026\017]\025\xb9nl\xa5\xa1\xc5\177\x95㍽\xd4\xf6\xd50\x8bc7\030λjd_\x86\xb1\xeb\xa8\xc1\\\x9dN\xbc\x81\xad^\aY\x82\xd1
> 

RCurl::getURL() seems to not be detecting either the Content-Encoding: gzip header nor the tell-tale first two byte "magic" code that also signals the content is gzip encoded. RCurl::getURL()似乎既没有检测到Content-Encoding: gzip标头,也没有检测到前两个字节的“魔术”代码,这也表明内容是 gzip 编码的。

I would suggest — as Michael did — switching to httr for reasons I'll go into in a bit, but this wld be a better httr idiom:我会建议——正如迈克尔所做的那样——出于我稍后会httr的原因而切换到httr ,但这将是一个更好的httr习语:

library(httr)

res <- GET("http://dogecoin.com/")
content(res)

The content() function extracts the raw response and returns an xml2 object which is similar to the XML library parsed object that you would likely have been using given the use of RCurl::getURL() . content()函数提取原始响应并返回一个xml2对象,该对象类似于在使用RCurl::getURL()您可能会使用的XML库解析对象。

The alternative way is to add some crutches to RCurl::getURL() :另一种方法是向RCurl::getURL()添加一些拐杖:

html_text_res <- RCurl::getURL("http://dogecoin.com/", encoding="gzip")

Here, we're explicitly informing getURL() that the content is gzip'd, but that's fraught with peril in that if the upstream server decides to use, say, brotli encoding, then you'll get an error.在这里,我们明确地通知getURL()内容是 gzip 格式的,但是这充满了危险,因为如果上游服务器决定使用,比如说,brotli 编码,那么你会得到一个错误。

If you still want to use RCurl vs switch to httr I'd suggest doing the following for this site:如果您仍然想使用RCurl与切换到httr我建议您为此站点执行以下操作:

RCurl::getURL("http://dogecoin.com/", 
              encoding = "gzip",
              httpheader = c(`Accept-Encoding` = "gzip"))

Here' were giving getURL() the decoding crutch but also explicitly telling the upstream server that gzip is 👍 and that it should send data with that encoding.这里'给了getURL()解码拐杖,但也明确告诉上游服务器 gzip 是 👍 并且它应该使用该编码发送数据。

However, httr would be a better choice since it and the curl package it uses deal with web server interaction and content in a more thorough way.但是, httr将是更好的选择,因为它和它使用的curl包以更彻底的方式处理 Web 服务器交互和内容。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM