![](/img/trans.png)
[英]Web-Scraping using R (I want to extract some table like data from a website)
[英]Web-Scraping using R - I want to extract some table like data from a website
我在從網站上抓取數據時遇到了一些問題。 我在網絡抓取方面沒有太多經驗。 我的計划是使用 R 從以下網站抓取一些數據: https ://www.fatf-gafi.org/countries/
更准確地說,我想提取具有某種制裁的國家列表
library(XML)
url <- paste0("https://www.fatf-gafi.org/countries/")
source <- readLines(url, encoding = "UTF-8")
parsed_doc <- htmlParse(source, encoding = "UTF-8")
但這並沒有顯示預期的信息,因為它不在表格下方,而是嵌套的 div。
這是一個棘手的解析工作。 您需要的信息不在您從readLines
獲得的 html 中。 相反,它是由頁面使用 XHR 請求動態加載的。 通常,像這樣的 XHR 請求會返回一個 json 字符串,但在您的情況下,它會返回 javascript,其中信息存儲為一個包含 json 片段數組的變量,每個國家/地區一個。 這可以通過一些字符串操作和 json 解析來獲得最終結果:
library(httr)
library(rvest)
url <- paste0('https://www.fatf-gafi.org/media/fatf/fatfv20/',
'js/country-data-multi-lang.js')
js <- content(GET(url), 'text')
vars <- strsplit(js, 'var countries = ')[[1]][2]
vars <- paste0("{", sub("^\\[\\{", "", strsplit(vars, '\\},\\{')[[1]]), "}")
countries <- do.call(rbind, lapply(vars[1:209],
function(x) as.data.frame(jsonlite::parse_json(x))))
countries <- countries[c(1, 4:13)]
names(countries) <- sub('^.*\\.', '', names(countries))
dplyr::tibble(countries)
#> # A tibble: 209 x 11
#> name FATF APG CFATF EAG ESAAMLG GABAC GAFILAT GIABA MENAFATF MONEYVAL
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Afghani~ "" "mbr" "" "obs" "" "" "" "" "" ""
#> 2 Albania "" "" "" "" "" "" "" "" "" "mbr"
#> 3 Algeria "" "" "" "" "" "" "" "" "mbr" ""
#> 4 Andorra "" "" "" "" "" "" "" "" "" "mbr"
#> 5 Angola "" "" "" "" "mbr" "" "" "" "" ""
#> 6 Anguilla "" "" "mbr" "" "" "" "" "" "" ""
#> 7 Antigua~ "" "" "mbr" "" "" "" "" "" "" ""
#> 8 Argenti~ "mbr" "non" "non" "non" "non" "" "mbr" "non" "non" "non"
#> 9 Armenia "" "" "" "obs" "" "" "" "" "" "mbr"
#> 10 Aruba K~ "els" "" "mbr" "" "" "" "" "" "" ""
#> # ... with 199 more rows
只是為了測試 JavaScript 評估如何與 V8、嵌入式 JavaScript 和 WebAssembly 引擎一起工作。
https://cran.r-project.org/web/packages/V8/vignettes/v8_intro.html
創建上下文引擎,評估請求的 JavaScript 並從 V8 中獲取countries
變量的值(它變成了嵌套數據框,因此是unnest()
),最后一行填充了NA
,因此是過濾器。
library(httr)
library(V8)
library(dplyr)
library(tidyr)
url <- paste0('https://www.fatf-gafi.org/media/fatf/fatfv20/',
'js/country-data-multi-lang.js')
js_content <- content(GET(url), 'text')
ct <- v8()
ct$eval(js_content)
ct$get("countries") %>%
unnest(cols = c(groups)) %>%
select(c(1:2,4:14,16)) %>%
filter(!is.na(name))
#> # A tibble: 209 × 14
#> name code FATF APG CFATF EAG ESAAMLG GABAC GAFILAT GIABA MENAFATF
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Afghanist… AF "" "mbr" "" "obs" "" "" "" "" ""
#> 2 Albania AL "" "" "" "" "" "" "" "" ""
#> 3 Algeria DZ "" "" "" "" "" "" "" "" "mbr"
#> 4 Andorra AD "" "" "" "" "" "" "" "" ""
#> 5 Angola AO "" "" "" "" "mbr" "" "" "" ""
#> 6 Anguilla AI "" "" "mbr" "" "" "" "" "" ""
#> 7 Antigua a… AG "" "" "mbr" "" "" "" "" "" ""
#> 8 Argentina AR "mbr" "non" "non" "non" "non" "" "mbr" "non" "non"
#> 9 Armenia AM "" "" "" "obs" "" "" "" "" ""
#> 10 Aruba Kin… AW "els" "" "mbr" "" "" "" "" "" ""
#> # … with 200 more rows, and 3 more variables: MONEYVAL <chr>,
#> # jurisdiction <chr>, id <chr>
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.