R - 將 XML 從多個 URL 解析為具有 tydiverse 和 xml2 的數據幀

Question

這個問題可能會被標記為重復，但我無法讓它發揮作用。 作為記錄，我已經閱讀了所有其他 stackoverflow 問題並閱讀了文檔。

我想從 iTunes 中提取多個頁面的數據評論（鏈接： https://itunes.apple.com/gb/rss/customerreviews/id=1388411277/page=1/xml ）並且我想以整潔和動態的方式進行方式，最好使用XML2和tydiverse 。

我的最終目標是：

在列中有一個包含所有可用字段（如 ID、作者等）的數據框並填充數據。

我的奮斗從一開始就開始了。 我只能運行鏈接並將其獲取為 XML，但我無法為提取的 XML 代碼運行簡單的代碼行。 我顯然在這里遺漏了一些東西。 我也不知道如何通過頁面 go 。 我知道存在多少頁，但我想以動態的方式擁有它。

library("tidyverse")
library("xml2")


# Data extraction ---------------------------------------------------------

df_xml <- read_xml('https://itunes.apple.com/gb/rss/customerreviews/id=1388411277/page=1/xml')

teste <- xml_text(xml_find_all(df_xml, '//feed/entry/ author')) *here I try to extract the field author*
> teste
> character(0)

謝謝大家

Answer 1

問題是當您調用xml_find_all(df_xml, '//feed/entry/ author')時，搜索找不到您要查找的節點，因為它們都在 xml 命名空間內。

uri <- "https://itunes.apple.com/gb/rss/customerreviews/id=1388411277/page=1/xml"
my_xml <- read_xml(uri)
xml_find_all(my_xml, "//feed")
#> {xml_nodeset (0)}

您可以像這樣找出文檔中使用了哪些命名空間：

xml_ns(my_xml)
#> d1 <-> http://www.w3.org/2005/Atom
#> im <-> http://itunes.apple.com/rss

因此，您可以在 xpath 中指定要使用的命名空間，您將獲得您正在尋找的節點，如下所示：

xml_find_all(my_xml, "//d1:feed")
#> {xml_nodeset (1)}
#> [1] <feed xmlns:im="http://itunes.apple.com/rss" xmlns="http://www.w3.org/2005/Atom ...

這顯然有點煩人，因為您必須在 xpath 中的所有標簽前加上d1: ，並且您的文檔結構使得您可以在沒有名稱空間的情況下進行操作，因此最好忽略它們。

我發現最簡單的方法是使用read_html而不是read_xml ，因為除其他外，它會自動去除命名空間並且更能容忍錯誤。 但是，如果您願意，可以在閱讀read_xml后調用 function， xml_ns_strip 。

因此，您在本文檔中處理命名空間的三個選項是：

使用d1:
在xml_ns_strip之后使用read_xml
使用read_html

此代碼將遍歷 xml 的所有頁面，並為您提供所有 365 條評論的特征向量。 你會發現雖然xml的每一頁都有100個content標簽，那是因為每個entry標簽里面有兩個content標簽。 其中一個具有評論的原始文本，另一個具有相同的內容，但采用 html 字符串的形式。 因此，循環丟棄包含字符串的 html 以支持原始文本：

library("tidyverse")
library("xml2")

base <- "https://itunes.apple.com/gb/rss/customerreviews/id=1388411277/page="
reviews <- author <- review_date <- character()
max_pages <- 100

for(i in seq(max_pages))
{
  cat("Trying", paste0(base, i, "/xml"), "\n")
  my_xml       <- paste0(base, i, "/xml") %>% read_xml() %>% xml_ns_strip()
  next_reviews <- xml_find_all(my_xml, xpath = '//feed/entry/content') %>% 
                  xml_text() %>%
                  subset(seq_along(.) %% 2 == 1)  
  if(length(next_reviews) == 0){
    result <- tibble(review_date, author, reviews)
    break
  }

  reviews      <- c(reviews, next_reviews)
  next_author  <- xml_text(xml_find_all(my_xml, xpath = '//feed/entry/author/name'))
  author       <- c(author, next_author)
  next_date    <- xml_text(xml_find_all(my_xml, xpath = '//feed/entry/updated'))
  review_date  <- c(review_date, next_date)
}
#> Trying https://itunes.apple.com/gb/rss/customerreviews/id=1388411277/page=1/xml 
#> Trying https://itunes.apple.com/gb/rss/customerreviews/id=1388411277/page=2/xml 
#> Trying https://itunes.apple.com/gb/rss/customerreviews/id=1388411277/page=3/xml 
#> Trying https://itunes.apple.com/gb/rss/customerreviews/id=1388411277/page=4/xml 
#> Trying https://itunes.apple.com/gb/rss/customerreviews/id=1388411277/page=5/xml 
#> Trying https://itunes.apple.com/gb/rss/customerreviews/id=1388411277/page=6/xml 
#> Trying https://itunes.apple.com/gb/rss/customerreviews/id=1388411277/page=7/xml 
#> Trying https://itunes.apple.com/gb/rss/customerreviews/id=1388411277/page=8/xml 
#> Trying https://itunes.apple.com/gb/rss/customerreviews/id=1388411277/page=9/xml

現在result將包含一個包含三個感興趣字段的tibble ：

result
#> # A tibble: 367 x 3
#>    review_date          author      reviews                                           
#>    <chr>                <chr>       <chr>                                             
#>  1 2020-05-05T02:38:35~ **stace**   "Really good and useful app. Nice to be able to g~
#>  2 2020-05-05T01:51:49~ fire-hazza~ "Not for Scotland or Wales cmon man"              
#>  3 2020-05-04T23:45:59~ Adz-Coco    "Unable to register due to NHS number. My number ~
#>  4 2020-05-04T23:34:50~ Matthew ba~ "Probably spent about £5 developing this applicat~
#>  5 2020-05-04T16:40:17~ Jenny19385~ "Why it is so complicated to sign up an account? ~
#>  6 2020-05-04T14:39:54~ Sienna hea~ "Thankyou NHS for this excellent app I feel a lot~
#>  7 2020-05-04T13:09:45~ Raresole    "A great app that lets me book appointments and a~
#>  8 2020-05-04T12:28:56~ chanters934 "Unable to login. App doesn’t recognise the code ~
#>  9 2020-05-04T11:26:44~ Ad_T        "Unfortunately my surgery must not be participati~
#> 10 2020-05-04T08:25:17~ tonyproctor "It’s a good app although would be better with a ~
#> # ... with 357 more rows

R - 將 XML 從多個 URL 解析為具有 tydiverse 和 xml2 的數據幀

問題描述

1 個解決方案

解決方案1
1 已采納 2020-05-05 12:01:53

R - 將 XML 從多個 URL 解析為具有 tydiverse 和 xml2 的數據幀

問題描述

1 個解決方案

解決方案1 1 已采納 2020-05-05 12:01:53

解決方案1
1 已采納 2020-05-05 12:01:53