R html_node挑戰，應用多個html_node提取相同的信息，然后合並信息

Question

我遇到了一個挑戰，即網站布局不規范。 我想從頁面中提取名稱。
但是，有些頁面將名稱存儲在<a>中，有些頁面將名稱存儲在<a> and <span>中，有些頁面存儲在<span>中。

 url="https://stackoverflow.com/questions/12573816/what-is-an-undefined-reference-unresolved-external-symbol-error-and-how-do-i-fix"
  
  page = read_html(url,encoding = "utf-8")

所以我在想從<a>中提取名稱保存到一個向量中，從<span>中提取名稱保存到另一個向量中。 然后進行比較以組合向量，但是，很難將兩個向量連接成唯一包含所有信息且順序正確的向量。

user_answeredquestion_a = page %>% html_nodes(xpath="//div[starts-with(@id, 'answer-' )]/div/div[2]/div[2]/div/div[last()]/div/div[last()]/a[last()]") %>%
    html_text()
    user_answeredquestion_a

 user_answeredquestion_span = page %>% html_nodes(xpath="//div[contains(@id, 'answer-' )]/div/div[2]/div[2]/div/div[last()]/div/div[last()]/span")  %>%html_text()
  user_answeredquestion_span

按理說，該頁面包含 30 條記錄。 用戶名向量的最終長度應為 30。但是，user_answeredquestion_span 僅返回 29 條記錄。 因為它錯過了記錄： Kastaneda同樣，user_answeredquestion_a 返回了 29 條記錄，它錯過了記錄：user4272649。
在這種情況下，真的很難比較和組合這兩個向量並保存到一個具有正確序列並包含所有記錄（30 條記錄）的新向量中

  ###which elements are missing from y with respect to x
  # x[!x %in% y]
  ### missing from user_answeredquestion_span
  user_answeredquestion_a[!user_answeredquestion_a %in% user_answeredquestion_span]
  ##"Kastaneda"
  
  ### missing from user_answeredquestion_a
  user_answeredquestion_span[!user_answeredquestion_span %in% user_answeredquestion_a]
  ### "user4272649"

我也嘗試同時使用兩個 xpath，它返回 58 條記錄。 它沒有任何意義。

### To get name from  <a> or <span>
  user_answeredquestion_all = page %>% html_nodes(xpath="//div[starts-with(@id, 'answer-' )]/div/div[2]/div[2]/div/div[last()]/div/div[last()]/a[last()]    | //div[contains(@id, 'answer-' )]/div/div[2]/div[2]/div/div[last()]/div/div[last()]/span") %>%
    html_text()
  user_answeredquestion_all

我可以知道處理不一致頁面結構的正確方法是什么嗎？
HTML 截圖如下： Kastaneda

用戶4272649

存儲在<a>和<span>中的其他用戶

向量中的 29 個元素

Answer 1

試試這個

library(rvest)

url <- "https://stackoverflow.com/questions/12573816/what-is-an-undefined-reference-unresolved-external-symbol-error-and-how-do-i-fix?page=1&tab=votes#tab-top"
path_to_flairs <- "//div[@class='-flair']"
path_to_answerers <- "//div[@class='grid fw-wrap ai-start jc-end gs8 gsy']/div[last()]/div/div[@class='user-details'][last()]/*[last()]"

page <- read_html(url)
# remove user flairs (e.g. reputation and gold badges) so that user names always appear at last
xml_remove(html_nodes(page, xpath = path_to_flairs)) 
page %>% html_nodes(xpath = path_to_answerers) %>% html_text()

Output

 [1] "Luchian Grigore" "Luchian Grigore" "Luchian Grigore" "Luchian Grigore" "Svalorzen"       "Kastaneda"       "Luchian Grigore" "sgryzko"         "Luchian Grigore"
[10] "Luchian Grigore" "Nima Soroush"    "πάντα ῥεῖ"       "Dula"            "Niall"           "Malvineous"      "user4272649"     "developerbmw"    "Niall"          
[19] "kiriloff"        "Plankalkül"      "Mike Kinghan"    "JDiMatteo"       "Niall"           "fafaro"          "Niall"           "Niall"           "Andreas H."     
[28] "Stypox"          "ead"             "Mike Kinghan"

R html_node挑戰，應用多個html_node提取相同的信息，然后合並信息

問題描述

1 個解決方案

解決方案1
1 2020-10-27 13:08:11

R html_node挑戰，應用多個html_node提取相同的信息，然后合並信息

問題描述

1 個解決方案

解決方案1 1 2020-10-27 13:08:11

解決方案1
1 2020-10-27 13:08:11