如何用R（https鏈接）webscrape安全頁面（使用XML包中的readHTMLTable）？

Question

關於如何使用XML包中的readHTMLTable，我有很好的答案，我使用常規的http頁面，但是我無法通過https頁面解決我的問題。

我想在這個網站上閱讀表格（網址字符串）：

library(RTidyHTML)
library(XML)
url <- "https://ned.nih.gov/search/ViewDetails.aspx?NIHID=0010121048"
h = htmlParse(url)
tables <- readHTMLTable(url)

但是我得到了這個錯誤：文件https://ned.nih.gov/search/Vi...does不存在。

我試圖通過https問題（下面的前兩行）（使用谷歌找到解決方案）（例如： http ： //tonybreyal.wordpress.com/2012/01/13/ra-quick-scrape-of -top-grossing-films-from-boxofficemojo-com / ）。

這個技巧有助於查看更多頁面，但任何提取表的嘗試都無法正常工作。 任何建議表示贊賞 我需要組織，組織標題，經理等表格字段。

 #attempt to get past the https problem 
 raw <- getURL(url, followlocation = TRUE, cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))
 head(raw)
[1] "\r\n<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n<html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\" lang=\"en\">\n<head>\n<meta http-equiv=\"Content-Type\" content=\"text/html; 
...
 h = htmlParse(raw)
Error in htmlParse(raw) : File ...
tables <- readHTMLTable(raw)
Error in htmlParse(doc) : File ...

Answer 1

新的包httr提供了一個包圍RCurl的包裝器，以便更容易刮掉各種頁面。

不過，這個頁面給了我相當多的麻煩。 以下工作，但毫無疑問有更簡單的方法。

library("httr")
library("XML")

# Define certicificate file
cafile <- system.file("CurlSSL", "cacert.pem", package = "RCurl")

# Read page
page <- GET(
  "https://ned.nih.gov/", 
  path="search/ViewDetails.aspx", 
  query="NIHID=0010121048",
  config(cainfo = cafile)
)

# Use regex to extract the desired table
x <- text_content(page)
tab <- sub('.*(<table class="grid".*?>.*</table>).*', '\\1', x)

# Parse the table
readHTMLTable(tab)

結果：

$ctl00_ContentPlaceHolder_dvPerson
                V1                                      V2
1      Legal Name:                    Dr Francis S Collins
2  Preferred Name:                      Dr Francis Collins
3          E-mail:                 francis.collins@nih.gov
4        Location: BG 1 RM 1261 CENTER DRBETHESDA MD 20814
5       Mail Stop:                                       Â
6           Phone:                            301-496-2433
7             Fax:                                       Â
8              IC:             OD (Office of the Director)
9    Organization:            Office of the Director (HNA)
10 Classification:                                Employee
11            TTY:                                       Â

獲取httr ： http ： httr

編輯：有關RCurl包的常見問題解答的實用頁面： http ： RCurl

Answer 2

使用Andrie超越https的好方法

在沒有readHTMLTable的情況下獲取數據的方法也在下面。

HTML中的表可能具有ID。 在這種情況下，表有一個很好的表，getNodeSet函數中的XPath很好。

# Define certicificate file
cafile <- system.file("CurlSSL", "cacert.pem", package = "RCurl")
# Read page
page <- GET(
  "https://ned.nih.gov/", 
  path="search/ViewDetails.aspx", 
  query="NIHID=0010121048",
  config(cainfo = cafile, ssl.verifypeer = FALSE)
)

h = htmlParse(page)
ns <- getNodeSet(h, "//table[@id = 'ctl00_ContentPlaceHolder_dvPerson']")
ns

我仍然需要提取超鏈接后面的ID。

例如，我需要轉到ID 0010080638，而不是作為經理的工作人員

經理：Colleen Barros

Answer 3

這是我必須處理這個問題的功能。 檢測url中是否有https，如果是，則使用httr。

readHTMLTable2=function(url, which=NULL, ...){
 require(httr)
 require(XML)
 if(str_detect(url,"https")){
    page <- GET(url, user_agent("httr-soccer-ranking"))
    doc = htmlParse(text_content(page))
    if(is.null(which)){
      tmp=readHTMLTable(doc, ...)
      }else{
        tableNodes = getNodeSet(doc, "//table")
        tab=tableNodes[[which]]
        tmp=readHTMLTable(tab, ...) 
      }
  }else{
    tmp=readHTMLTable(url, which=which, ...) 
  }
  return(tmp)
}

如何用R（https鏈接）webscrape安全頁面（使用XML包中的readHTMLTable）？

問題描述

3 個解決方案

解決方案1
26 已采納 2012-05-22 04:58:06

解決方案2
4 2012-05-22 17:57:06

解決方案3
0 2018-01-21 06:09:11

如何用R（https鏈接）webscrape安全頁面（使用XML包中的readHTMLTable）？

問題描述

3 個解決方案

解決方案1 26 已采納 2012-05-22 04:58:06

解決方案2 4 2012-05-22 17:57:06

解決方案3 0 2018-01-21 06:09:11

解決方案1
26 已采納 2012-05-22 04:58:06

解決方案2
4 2012-05-22 17:57:06

解決方案3
0 2018-01-21 06:09:11