簡體   English   中英

使用 R 進行 Web-Scraping - 我想從網站中提取一些類似數據的表格

[英]Web-Scraping using R - I want to extract some table like data from a website

我在從網站上抓取數據時遇到了一些問題。 我在網絡抓取方面沒有太多經驗。 我的計划是使用 R 從以下網站抓取一些數據: https ://www.fatf-gafi.org/countries/

更准確地說,我想提取具有某種制裁的國家列表

library(XML)
  url <- paste0("https://www.fatf-gafi.org/countries/")
  source <- readLines(url, encoding = "UTF-8")
  parsed_doc <- htmlParse(source, encoding = "UTF-8")

但這並沒有顯示預期的信息,因為它不在表格下方,而是嵌套的 div。

這是一個棘手的解析工作。 您需要的信息不在您從readLines獲得的 html 中。 相反,它是由頁面使用 XHR 請求動態加載的。 通常,像這樣的 XHR 請求會返回一個 json 字符串,但在您的情況下,它會返回 javascript,其中信息存儲為一個包含 json 片段數組的變量,每個國家/地區一個。 這可以通過一些字符串操作和 json 解析來獲得最終結果:

library(httr)
library(rvest)

url <- paste0('https://www.fatf-gafi.org/media/fatf/fatfv20/',
              'js/country-data-multi-lang.js')
js <- content(GET(url), 'text')

vars <- strsplit(js, 'var countries = ')[[1]][2]
vars <- paste0("{", sub("^\\[\\{", "", strsplit(vars, '\\},\\{')[[1]]), "}")
countries <- do.call(rbind, lapply(vars[1:209], 
                      function(x) as.data.frame(jsonlite::parse_json(x))))
countries <- countries[c(1, 4:13)]
names(countries) <- sub('^.*\\.', '', names(countries))

dplyr::tibble(countries)
#> # A tibble: 209 x 11
#>   name     FATF  APG   CFATF EAG   ESAAMLG GABAC GAFILAT GIABA MENAFATF MONEYVAL
#>   <chr>    <chr> <chr> <chr> <chr> <chr>   <chr> <chr>   <chr> <chr>    <chr>   
#> 1 Afghani~ ""    "mbr" ""    "obs" ""      ""    ""      ""    ""       ""      
#> 2 Albania  ""    ""    ""    ""    ""      ""    ""      ""    ""       "mbr"   
#> 3 Algeria  ""    ""    ""    ""    ""      ""    ""      ""    "mbr"    ""      
#> 4 Andorra  ""    ""    ""    ""    ""      ""    ""      ""    ""       "mbr"   
#> 5 Angola   ""    ""    ""    ""    "mbr"   ""    ""      ""    ""       ""      
#> 6 Anguilla ""    ""    "mbr" ""    ""      ""    ""      ""    ""       ""      
#> 7 Antigua~ ""    ""    "mbr" ""    ""      ""    ""      ""    ""       ""      
#> 8 Argenti~ "mbr" "non" "non" "non" "non"   ""    "mbr"   "non" "non"    "non"   
#> 9 Armenia  ""    ""    ""    "obs" ""      ""    ""      ""    ""       "mbr"   
#> 10 Aruba K~ "els" ""    "mbr" ""    ""      ""    ""      ""    ""       ""      
#> # ... with 199 more rows

只是為了測試 JavaScript 評估如何與 V8、嵌入式 JavaScript 和 WebAssembly 引擎一起工作。
https://cran.r-project.org/web/packages/V8/vignettes/v8_intro.html

創建上下文引擎,評估請求的 JavaScript 並從 V8 中獲取countries變量的值(它變成了嵌套數據框,因此是unnest() ),最后一行填充了NA ,因此是過濾器。

library(httr)
library(V8)
library(dplyr)
library(tidyr)
url <- paste0('https://www.fatf-gafi.org/media/fatf/fatfv20/',
              'js/country-data-multi-lang.js')
js_content <- content(GET(url), 'text')

ct <- v8()
ct$eval(js_content)
ct$get("countries") %>% 
  unnest(cols = c(groups)) %>%
  select(c(1:2,4:14,16)) %>%
  filter(!is.na(name))

#> # A tibble: 209 × 14
#>    name       code  FATF  APG   CFATF EAG   ESAAMLG GABAC GAFILAT GIABA MENAFATF
#>    <chr>      <chr> <chr> <chr> <chr> <chr> <chr>   <chr> <chr>   <chr> <chr>   
#>  1 Afghanist… AF    ""    "mbr" ""    "obs" ""      ""    ""      ""    ""      
#>  2 Albania    AL    ""    ""    ""    ""    ""      ""    ""      ""    ""      
#>  3 Algeria    DZ    ""    ""    ""    ""    ""      ""    ""      ""    "mbr"   
#>  4 Andorra    AD    ""    ""    ""    ""    ""      ""    ""      ""    ""      
#>  5 Angola     AO    ""    ""    ""    ""    "mbr"   ""    ""      ""    ""      
#>  6 Anguilla   AI    ""    ""    "mbr" ""    ""      ""    ""      ""    ""      
#>  7 Antigua a… AG    ""    ""    "mbr" ""    ""      ""    ""      ""    ""      
#>  8 Argentina  AR    "mbr" "non" "non" "non" "non"   ""    "mbr"   "non" "non"   
#>  9 Armenia    AM    ""    ""    ""    "obs" ""      ""    ""      ""    ""      
#> 10 Aruba Kin… AW    "els" ""    "mbr" ""    ""      ""    ""      ""    ""      
#> # … with 200 more rows, and 3 more variables: MONEYVAL <chr>,
#> #   jurisdiction <chr>, id <chr>

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM