簡體   English   中英

R - 如何從下載的HTML代碼中提取JavaScript對象中的值

[英]R - How can I extract values inside a JavaScript object from the downloaded HTML code

我正在使用包rvest進行網頁抓取,但我很難提取JavaScript對象的信息。

JavaScript的形式為:

... some js ...
var selectoptions = {
  "Region A": {
  "key" : "a",
  "defaultvalue" : "a",
  "values" : { //key : value
                    "(A01) A1": "a01",
                    "(A02) A2": "a02",
                    "(A03) A3": "a03",
                    "(A04) A4": "a04"
  }
 }, 
  "Region B": {
  "key" : "b",
  "defaultvalue" : "b",
  "values" : { //key : value
                    "(B01) B1": "b01",
                    "(B02) B2": "b02",
                    "(B03) B3": "b03",
                    "(B04) B4": "b04"
  }
 }
}
... some js ...

如何提取信息(每個區域的"values" )?

以下是我的嘗試:

library(rvest)
library(stringr)
url <- "http://www.census2011.gov.hk/en/constituency-area.html" #the url
js_code <- html(url) %>% html_nodes("script") %>% html_text()
js_code <- js_code[[9]] # The information I wanted is in the 9th element
info_wanted1 <- str_extract(js_code, "\\{.*?\\}")
info_wanted2 <- str_extract_all(js_code, "\\{.*?\\}")

> info_wanted1
[1] NA
> info_wanted2
[[1]]
character(0)

但它什么也沒有回報。 我想它至少會給我{ }嵌入的東西。 我犯了什么錯誤嗎? 有什么建議么?

謝謝!

這是我能夠構造的最干凈的正則表達式解析:

js2 <- strsplit(js_code,  "value")  # first split on "value"
# Then remove the first item which preceded the first instance and work on the rest.

js3 <- lapply( js2[[1]][-1], function(tx) {
                                  regmatches(tx, gregexpr("value[^{]+\\}", tx) ) })

該模式的[^}]\\\\}"部分是一個否定的字符類,並且基本上表示在文本value之后的第一個curley-brace之前返回所有非curley-brace字符。

----

早期的探索:

首先,我將該文本分配給變量名稱txt ,但沒有使用讀取操作,這些操作會通過換行符將其打破。

您的模式在該文本中不匹配:

> regmatches(txt, gregexpr("\\{.?\\n\\}", txt) )
[[1]]
character(0)

但稍作修改就可以了:

> regmatches(txt, gregexpr("\\{.+\\n\\}", txt) )
[[1]]
[1] "{\n  \"Region A\": {\n  \"key\" : \"a\",\n  \"defaultvalue\" : \"a\",\n  \"values\" : { //key : value\n                    \"(A01) A1\": \"a01\",\n                    \"(A02) A2\": \"a02\",\n                    \"(A03) A3\": \"a03\",\n                    \"(A04) A4\": \"a04\"\n  }\n }, \n  \"Region B\": {\n  \"key\" : \"b\",\n  \"defaultvalue\" : \"b\",\n  \"values\" : { //key : value\n                    \"(B01) B1\": \"b01\",\n                    \"(B02) B2\": \"b02\",\n                    \"(B03) B3\": \"b03\",\n                    \"(B04) B4\": \"b04\"\n  }\n }\n}"

由於正則表達式通常是“貪婪的”,因此算法找到第一個匹配,然后盡可能地匹配所有內容,包括最后一個curley-brace。

要打敗正則表達式的貪婪,首先需要通過適當的分隔符拆分成單獨的字符向量元素,然后選擇字符串: values

> js2 <- strsplit(js_code,  "values")
> js3 <- lapply( js2[[1]], function(tx) {regmatches(tx, gregexpr("\\{.+\\}", tx) ) })
> js3[[1]]
[[1]]
[1] "{\r\n\t\t //create a bubble popup for each DOM element with class attribute as \"text\", \"button\" or \"link\" and LI, P, IMG elements.\r\n\t\t $('.link-1').CreateBubblePopup({\r\n  position : 'top',\r\n  align : 'center',\r\n  innerHtml: 'Terms and Definitions',\r\n  innerHtmlStyle: {\r\n\t\t\t  color:'#FFFFFF', \r\n\t\t\t  'text-align':'center',\r\n\t\t\t  'padding':'5px'\r\n\t\t\t },\r\n  themeName: 'all-black',\r\n  themePath: 'images/jquerybubblepopup-theme'\r\n });\r\n\t\t $('.link-2').CreateBubblePopup({\r\n  position : 'top',\r\n  align\t : 'center',\r\n  innerHtml: 'Data Dissemination Events',\r\n  innerHtmlStyle: {\r\n   color:'#FFFFFF', \r\n   'text-align':'center',\r\n   'padding':'5px'\r\n  },\r\n  themeName: \t'all-black',\r\n  themePath: \t'images/jquerybubblepopup-theme'\r\n });\r\n $('.link-3').CreateBubblePopup({\r\n  position : 'top',\r\n  align\t : 'center',\r\n  innerHtml: 'Download 2011 District Council Electoral Boundaries Index Map',\r\n  innerHtmlStyle: {\r\n   color:'#FFFFFF', \r\n   'text-align':'center',\r\n   'padding':'5px'\r\n  },\r\n  themeName: \t'all-black',\r\n\t\t\t\tthemePath: \t'images/jquerybubblepopup-theme'\r\n });\r\n  });\r\n  $(document).ready(function(){\r\n\t  var options = {\r\n\t\t\t\tpreselectFirst : \"hki\",\r\n\t\t\t\tpreselectSecond : \"a01\",\r\n\t\t\t\temptyOption: false,\r\n\t\t\t\temptyValue: 'Please Select',\r\n\t\t\t\temptyKey: '-'\r\n }"

> js3[[2]]
[[1]]
[1] "{ //key : value\r\n\t\t\t\t\t\"(A01) Chung Wan\": \"a01\",\r\n\t\t\t\t\t\"(A02) Mid Levels East\": \"a02\",\r\n\t\t\t\t\t\"(A03) Castle Road\": \"a03\",\r\n\t\t\t\t\t\"(A04) Peak\": \"a04\",\r\n\t\t\t\t\t\"(A05) University\": \"a05\",\r\n\t\t\t\t\t\"(A06) Kennedy Town & Mount Davis\": \"a06\",\r\n\t\t\t\t\t\"(A07) Kwun Lung\": \"a07\",\r\n\t\t\t\t\t\"(A08) Sai Wan\": \"a08\",\r\n\t\t\t\t\t\"(A09) Belcher\": \"a09\",\r\n\t\t\t\t\t\"(A10) Shek Tong Tsui\": \"a10\",\r\n\t\t\t\t\t\"(A11) Sai Ying Pun\": \"a11\",\r\n\t\t\t\t\t\"(A12) Sheung Wan\": \"a12\",\r\n\t\t\t\t\t\"(A13) Tung Wah\": \"a13\",\r\n\t\t\t\t\t\"(A14) Centre Street\": \"a14\",\r\n\t\t\t\t\t\"(A15) Water Street\": \"a15\",\r\n\t\t\t\t\t\"(B01) Hennessy\": \"b01\",\r\n\t\t\t\t\t\"(B02) Oi Kwan\": \"b02\",\r\n\t\t\t\t\t\"(B03) Canal Road\": \"b03\",\r\n\t\t\t\t\t\"(B04) Causeway Bay\": \"b04\",\r\n\t\t\t\t\t\"(B05) Tai Hang\": \"b05\",\r\n\t\t\t\t\t\"(B06) Jardine's Lookout\": \"b06\",\r\n\t\t\t\t\t\"(B07) Broadwood\": \"b07\",\r\n\t\t\t\t\t\"(B08) Happy Valley\": \"b08\",\r\n\t\t\t\t\t\"(B09) Stubbs Road\": \"b09\",\r\n\t\t\t\t\t\"(B10) Southorn\": \"b10\",\r\n\t\t\t\t\t\"(B11) Tai Fat Hau\": \"b11\",\r\n\t\t\t\t\t\"(C01) Tai Koo Shing West\": \"c01\",\r\n\t\t\t\t\t\"(C02) Tai Koo Shing East\": \"c02\",\r\n\t\t\t\t\t\"(C03) Lei King Wan\": \"c03\",\r\n\t\t\t\t\t\"(C04) Aldrich Bay\": \"c04\",\r\n\t\t\t\t\t\"(C05) Shaukeiwan\": \"c05\",\r\n\t\t\t\t\t\"(C06) A Kung Ngam\": \"c06\",\r\n\t\t\t\t\t\"(C07) Heng Fa Chuen\": \"c07\",\r\n\t\t\t\t\t\"(C08) Tsui Wan\": \"c08\",\r\n\t\t\t\t\t\"(C09) Yan Lam\": \"c09\",\r\n\t\t\t\t\t\"(C10) Siu Sai Wan\": \"c10\",\r\n\t\t\t\t\t\"(C11) King Yee\": \"c11\",\r\n\t\t\t\t\t\"(C12) Wan Tsui\": \"c12\",\r\n\t\t\t\t\t\"(C13) Fei Tsui\": \"c13\",\r\n\t\t\t\t\t\"(C14) Mount Parker\": \"c14\",\r\n\t\t\t\t\t\"(C15) Braemar Hill\": \"c15\",\r\n\t\t\t\t\t\"(C16) Tin Hau\": \"c16\",\r\n\t\t\t\t\t\"(C17) Fortress Hill\": \"c17\",\r\n\t\t\t\t\t\"(C18) Victoria Park\": \"c18\",\r\n\t\t\t\t\t\"(C19) City Garden\": \"c19\",\r\n\t\t\t\t\t\"(C20) Provident\": \"c20\",\r\n\t\t\t\t\t\"(C21) Fort Street\": \"c21\",\r\n\t\t\t\t\t\"(C22) Kam Ping\": \"c22\",\r\n\t\t\t\t\t\"(C23) Tanner\": \"c23\",\r\n\t\t\t\t\t\"(C24) Healthy Village\": \"c24\",\r\n\t\t\t\t\t\"(C25) Quarry Bay\": \"c25\",\r\n\t\t\t\t\t\"(C26) Nam Fung\": \"c26\",\r\n\t\t\t\t\t\"(C27) Kornhill\": \"c27\",\r\n\t\t\t\t\t\"(C28) Kornhill Garden\": \"c28\",\r\n\t\t\t\t\t\"(C29) Hing Tung\": \"c29\",\r\n\t\t\t\t\t\"(C30) Sai Wan Ho\": \"c30\",\r\n\t\t\t\t\t\"(C31) Lower Yiu Tung\": \"c31\",\r\n\t\t\t\t\t\"(C32) Upper Yiu Tung\": \"c32\",\r\n\t\t\t\t\t\"(C33) Hing Man\": \"c33\",\r\n\t\t\t\t\t\"(C34) Lok Hong\": \"c34\",\r\n\t\t\t\t\t\"(C35) Tsui Tak\": \"c35\",\r\n\t\t\t\t\t\"(C36) Yue Wan\": \"c36\",\r\n\t\t\t\t\t\"(C37) Kai Hiu\": \"c37\",\r\n\t\t\t\t\t\"(D01) Aberdeen\": \"d01\",\r\n\t\t\t\t\t\"(D02) Ap Lei Chau Estate\": \"d02\",\r\n\t\t\t\t\t\"(D03) Ap Lei Chau North\": \"d03\",\r\n\t\t\t\t\t\"(D04) Lei Tung I\": \"d04\",\r\n\t\t\t\t\t\"(D05) Lei Tung II\": \"d05\",\r\n\t\t\t\t\t\"(D06) South Horizons East\": \"d06\",\r\n\t\t\t\t\t\"(D07) South Horizons West\": \"d07\",\r\n\t\t\t\t\t\"(D08) Wah Kwai\": \"d08\",\r\n\t\t\t\t\t\"(D09) Wah Fu I\": \"d09\",\r\n\t\t\t\t\t\"(D10) Wah Fu II\": \"d10\",\r\n\t\t\t\t\t\"(D11) Pokfulam\": \"d11\",\r\n\t\t\t\t\t\"(D12) Chi Fu\": \"d12\",\r\n\t\t\t\t\t\"(D13) Tin Wan\": \"d13\",\r\n\t\t\t\t\t\"(D14) Shek Yue\": \"d14\",\r\n\t\t\t\t\t\"(D15) Wong Chuk Hang\": \"d15\",\r\n\t\t\t\t\t\"(D16) Bays Area\": \"d16\",\r\n\t\t\t\t\t\"(D17) Stanley & Shek O\": \"d17\"\r\n  }\r\n }"

然后,您需要通過修剪這些塊的前導和尾隨部分中的不必要的東西來“清理”,事實證明,至少刪除了第一個看起來不像您想要的表格的項目。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM