[英]Web scraping with R - html content
我正在嘗試使用 R 進行網絡抓取,但是我在從網絡中提取 html 內容時遇到了問題。
這是我在 Amazon 的示例頁面上進行的一項練習,其中包含一些查詢。
library(XML)
#> Warning message:
#> XML package is in R 3.5.3 version
my_url99 <- "https://www.amazon.com/s?k=Dell+laptop+windows+10&ref=nb_sb_noss_2"
html_page99 <- htmlTreeParse(my_url99, useInternalNode=TRUE)
#> Warning message:
#> XML content does not seem to be XML: 'https://www.amazon.com/s?k=Dell+laptop+windows+10&ref=nb_sb_noss_2'
head(html_page99)
#> Error in `[.XMLInternalDocument`(x, seq_len(n)) :
#> No method for subsetting an XMLInternalDocument with integer
html_page99
#> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
#> <html><body><p>https://www.amazon.com/s?k=Dell+laptop+windows+10&ref=nb_sb_noss_2</p></body></html>
但是我需要用完整的內容來抓取上面的頁面。 我的意思是左側帶有 $ 符號的內容(也許這不是最好的直接描述)和所有標簽。
如果沒有很多抓取和操作字符串的經驗,很難得到你想要的數據。 正如@ThomasL 指出的那樣,使用XML
庫並不是最好的前進方式。 以下是如何使用rvest
庫實現您想要的結果:
library(rvest)
#> Loading required package: xml2
library(tibble)
my_url99 <- "https://www.amazon.com/s?k=Dell+laptop+windows+10&ref=nb_sb_noss_2"
read_html(my_url99) %>%
html_nodes(xpath = "//div[@class = 'sg-row']") %>%
html_text() %>%
{gsub("\n", " ", .)} %>%
{grep("5 stars", ., value = TRUE)} %>%
{grep("Sponsored", ., invert = TRUE, value = TRUE)} %>%
{gsub("^ +", "", .)} %>%
{grep("[$]", ., value = TRUE)} %>%
{gsub("[$][0123456789.]+[$]", "$", .)} %>%
strsplit(" {2,50}") %>%
lapply(function(x) x[x != ""]) %>%
lapply(function(x) { grep("Buying Choices|Ships to|in stock|new offers",
x, invert = TRUE, value = TRUE) }) %>%
lapply(function(x) if(length(x) < 4) NULL else x[c(1, 2, 4)]) %>%
{do.call(rbind, .)} %>%
`colnames<-`(c("Model", "Rating","Price")) %>%
as_tibble() ->
result
為您提供帶有型號、星級和價格的 3 列數據框(或 tibble):
result
#> # A tibble: 15 x 3
#> Model Rating Price
#> <chr> <chr> <chr>
#> 1 "Dell Latitude E6430 Laptop WEBCAM - HDMI - Intel Core~ 4.0 out of 5 ~ $201.~
#> 2 "New ! Dell Inspiron i3583 15.6\" HD Touch-Screen Lapt~ 4.3 out of 5 ~ $349.~
#> 3 "2019_Dell Inspiron 15.6\" HD High Performance Laptop,~ 3.9 out of 5 ~ $300.~
#> 4 "Dell Inspiron 15.6” Touch Screen Intel Core i3 128GB ~ 4.1 out of 5 ~ $365.~
#> 5 "Dell Inspiron 15.6 Inch HD Touchscreen Flagship High ~ 4.1 out of 5 ~ $443.~
#> 6 "Dell Latitude E5450 14in Laptop, Intel Core i5-5300U ~ 3.7 out of 5 ~ $225.~
#> 7 "New ! Dell Inspiron i3583 15.6\" HD Touch-Screen Lapt~ 4.5 out of 5 ~ $445.~
#> 8 "Dell 14inch High Performance Latitude 3340 Notebook, ~ 4.1 out of 5 ~ $199.~
#> 9 "2018 Dell Business Flagship Laptop Notebook 15.6\" HD~ 3.4 out of 5 ~ $567.~
#> 10 "Dell Latitude E6420 Laptop - HDMI - i5 2.5ghz - 4GB D~ 3.3 out of 5 ~ $178.~
#> 11 "Newest_Dell Vostro Real Business(Better Design Than I~ 3.6 out of 5 ~ $689.~
#> 12 "2019 Dell Inspiron 15 6\" HD Touchscreen Flagship Pre~ 4.0 out of 5 ~ $348.~
#> 13 "2019 Dell Inspiron 14\" Laptop Computer| 10th Gen Int~ 4.1 out of 5 ~ $328.~
#> 14 "2019 Dell Inspiron 15 6\" HD Touchscreen Flagship Pre~ 4.4 out of 5 ~ $442.~
#> 15 "Fast Dell Latitude E5470 HD Business Laptop Notebook ~ 4.5 out of 5 ~ $288.~
由reprex 包(v0.3.0) 於 2020 年 2 月 17 日創建
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.