简体   繁体   English

使用 R - html 内容抓取网页

[英]Web scraping with R - html content

I am trying web scraping with R, but I am having problems pulling html content from the web.我正在尝试使用 R 进行网络抓取,但是我在从网络中提取 html 内容时遇到了问题。

Here is an exercise I'm doing on an example page from Amazon with some queries.这是我在 Amazon 的示例页面上进行的一项练习,其中包含一些查询。

library(XML)

#> Warning message:
#> XML package is in R 3.5.3 version 

my_url99 <- "https://www.amazon.com/s?k=Dell+laptop+windows+10&ref=nb_sb_noss_2"
html_page99 <- htmlTreeParse(my_url99, useInternalNode=TRUE)

#> Warning message:
#> XML content does not seem to be XML: 'https://www.amazon.com/s?k=Dell+laptop+windows+10&ref=nb_sb_noss_2' 

head(html_page99)

#> Error in `[.XMLInternalDocument`(x, seq_len(n)) : 
#>  No method for subsetting an XMLInternalDocument with integer

html_page99

#> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
#> <html><body><p>https://www.amazon.com/s?k=Dell+laptop+windows+10&amp;ref=nb_sb_noss_2</p></body></html>

But I need to scrape the above page with full content.但是我需要用完整的内容来抓取上面的页面。 I mean content with $ sign on the left (maybe that's not the best direct description) and all the tags.我的意思是左侧带有 $ 符号的内容(也许这不是最好的直接描述)和所有标签。

Without a lot of experience in scraping and manipulating strings, it is difficult to get at the data you want.如果没有很多抓取和操作字符串的经验,很难得到你想要的数据。 As @ThomasL points out, using the XML library is not the best way forward.正如@ThomasL 指出的那样,使用XML库并不是最好的前进方式。 Here is how you could achieve the results you want using the rvest library:以下是如何使用rvest库实现您想要的结果:

library(rvest)
#> Loading required package: xml2
library(tibble)

my_url99 <- "https://www.amazon.com/s?k=Dell+laptop+windows+10&ref=nb_sb_noss_2"

read_html(my_url99)                                                       %>%
html_nodes(xpath = "//div[@class = 'sg-row']")                            %>% 
html_text()                                                               %>% 
{gsub("\n", " ", .)}                                                      %>% 
{grep("5 stars", ., value = TRUE)}                                        %>% 
{grep("Sponsored", ., invert = TRUE, value = TRUE)}                       %>% 
{gsub("^ +", "", .)}                                                      %>% 
{grep("[$]", ., value = TRUE)}                                            %>% 
{gsub("[$][0123456789.]+[$]", "$", .)}                                    %>% 
strsplit(" {2,50}")                                                       %>% 
lapply(function(x) x[x != ""])                                            %>% 
lapply(function(x) { grep("Buying Choices|Ships to|in stock|new offers", 
                          x, invert = TRUE, value = TRUE)              }) %>%
lapply(function(x) if(length(x) < 4) NULL else x[c(1, 2, 4)])             %>%
{do.call(rbind, .)}                                                       %>% 
`colnames<-`(c("Model", "Rating","Price"))                                %>%
as_tibble()                                                                ->
result

Giving you a 3 column data frame (or tibble) with model, star rating and price:为您提供带有型号、星级和价格的 3 列数据框(或 tibble):

result
#> # A tibble: 15 x 3
#>    Model                                                   Rating         Price 
#>    <chr>                                                   <chr>          <chr> 
#>  1 "Dell Latitude E6430 Laptop WEBCAM - HDMI - Intel Core~ 4.0 out of 5 ~ $201.~
#>  2 "New ! Dell Inspiron i3583 15.6\" HD Touch-Screen Lapt~ 4.3 out of 5 ~ $349.~
#>  3 "2019_Dell Inspiron 15.6\" HD High Performance Laptop,~ 3.9 out of 5 ~ $300.~
#>  4 "Dell Inspiron 15.6” Touch Screen Intel Core i3 128GB ~ 4.1 out of 5 ~ $365.~
#>  5 "Dell Inspiron 15.6 Inch HD Touchscreen Flagship High ~ 4.1 out of 5 ~ $443.~
#>  6 "Dell Latitude E5450 14in Laptop, Intel Core i5-5300U ~ 3.7 out of 5 ~ $225.~
#>  7 "New ! Dell Inspiron i3583 15.6\" HD Touch-Screen Lapt~ 4.5 out of 5 ~ $445.~
#>  8 "Dell 14inch High Performance Latitude 3340 Notebook, ~ 4.1 out of 5 ~ $199.~
#>  9 "2018 Dell Business Flagship Laptop Notebook 15.6\" HD~ 3.4 out of 5 ~ $567.~
#> 10 "Dell Latitude E6420 Laptop - HDMI - i5 2.5ghz - 4GB D~ 3.3 out of 5 ~ $178.~
#> 11 "Newest_Dell Vostro Real Business(Better Design Than I~ 3.6 out of 5 ~ $689.~
#> 12 "2019 Dell Inspiron 15 6\" HD Touchscreen Flagship Pre~ 4.0 out of 5 ~ $348.~
#> 13 "2019 Dell Inspiron 14\" Laptop Computer| 10th Gen Int~ 4.1 out of 5 ~ $328.~
#> 14 "2019 Dell Inspiron 15 6\" HD Touchscreen Flagship Pre~ 4.4 out of 5 ~ $442.~
#> 15 "Fast Dell Latitude E5470 HD Business Laptop Notebook ~ 4.5 out of 5 ~ $288.~

Created on 2020-02-17 by the reprex package (v0.3.0)reprex 包(v0.3.0) 于 2020 年 2 月 17 日创建

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM