[英]R html_node challenge, apply multiple html_node to extract same information, then combine the information
我遇到了一個挑戰,即網站布局不規范。 我想從頁面中提取名稱。
但是,有些頁面將名稱存儲在<a>
中,有些頁面將名稱存儲在<a> and <span>
中,有些頁面存儲在<span>
中。
url="https://stackoverflow.com/questions/12573816/what-is-an-undefined-reference-unresolved-external-symbol-error-and-how-do-i-fix"
page = read_html(url,encoding = "utf-8")
所以我在想從<a>
中提取名稱保存到一個向量中,從<span>
中提取名稱保存到另一個向量中。 然后進行比較以組合向量,但是,很難將兩個向量連接成唯一包含所有信息且順序正確的向量。
user_answeredquestion_a = page %>% html_nodes(xpath="//div[starts-with(@id, 'answer-' )]/div/div[2]/div[2]/div/div[last()]/div/div[last()]/a[last()]") %>%
html_text()
user_answeredquestion_a
user_answeredquestion_span = page %>% html_nodes(xpath="//div[contains(@id, 'answer-' )]/div/div[2]/div[2]/div/div[last()]/div/div[last()]/span") %>%html_text()
user_answeredquestion_span
按理說,該頁面包含 30 條記錄。 用戶名向量的最終長度應為 30。但是,user_answeredquestion_span 僅返回 29 條記錄。 因為它錯過了記錄: Kastaneda
同樣,user_answeredquestion_a 返回了 29 條記錄,它錯過了記錄:user4272649。
在這種情況下,真的很難比較和組合這兩個向量並保存到一個具有正確序列並包含所有記錄(30 條記錄)的新向量中
###which elements are missing from y with respect to x
# x[!x %in% y]
### missing from user_answeredquestion_span
user_answeredquestion_a[!user_answeredquestion_a %in% user_answeredquestion_span]
##"Kastaneda"
### missing from user_answeredquestion_a
user_answeredquestion_span[!user_answeredquestion_span %in% user_answeredquestion_a]
### "user4272649"
我也嘗試同時使用兩個 xpath,它返回 58 條記錄。 它沒有任何意義。
### To get name from <a> or <span>
user_answeredquestion_all = page %>% html_nodes(xpath="//div[starts-with(@id, 'answer-' )]/div/div[2]/div[2]/div/div[last()]/div/div[last()]/a[last()] | //div[contains(@id, 'answer-' )]/div/div[2]/div[2]/div/div[last()]/div/div[last()]/span") %>%
html_text()
user_answeredquestion_all
我可以知道處理不一致頁面結構的正確方法是什么嗎?
HTML 截圖如下: Kastaneda
試試這個
library(rvest)
url <- "https://stackoverflow.com/questions/12573816/what-is-an-undefined-reference-unresolved-external-symbol-error-and-how-do-i-fix?page=1&tab=votes#tab-top"
path_to_flairs <- "//div[@class='-flair']"
path_to_answerers <- "//div[@class='grid fw-wrap ai-start jc-end gs8 gsy']/div[last()]/div/div[@class='user-details'][last()]/*[last()]"
page <- read_html(url)
# remove user flairs (e.g. reputation and gold badges) so that user names always appear at last
xml_remove(html_nodes(page, xpath = path_to_flairs))
page %>% html_nodes(xpath = path_to_answerers) %>% html_text()
Output
[1] "Luchian Grigore" "Luchian Grigore" "Luchian Grigore" "Luchian Grigore" "Svalorzen" "Kastaneda" "Luchian Grigore" "sgryzko" "Luchian Grigore"
[10] "Luchian Grigore" "Nima Soroush" "πάντα ῥεῖ" "Dula" "Niall" "Malvineous" "user4272649" "developerbmw" "Niall"
[19] "kiriloff" "Plankalkül" "Mike Kinghan" "JDiMatteo" "Niall" "fafaro" "Niall" "Niall" "Andreas H."
[28] "Stypox" "ead" "Mike Kinghan"
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.