简体   繁体   中英

Web scraping with R and rvest

I am experimenting with rvest to learn web scraping with R. I am trying to replicate the Lego example for a couple of other sections of the page and using selector gadget to id.

I pulled the example from R Studio tutorial . With the code below, 1 and 2 work, but 3 does not.

library(rvest)
lego_movie <- html("http://www.imdb.com/title/tt1490017/")

# 1 - Get rating
lego_movie %>% 
  html_node("strong span") %>%
  html_text() %>%
  as.numeric()

# 2 - Grab actor names
lego_movie %>%
  html_nodes("#titleCast .itemprop span") %>%
  html_text()

# 3 - Get Meta Score 
lego_movie %>% 
  html_node(".star-box-details a:nth-child(4)") %>%
  html_text() %>%
  as.numeric()

I'm not really up to speed on all of the pipes and associated code, so there's probably some new fandangled tools to do this...but given that the answer above gets you to "83/100" , you can do something like this:

as.numeric(unlist(strsplit("83/100", "/")))[1]
[1] 83

Which I guess would look something like this with the pipes:

lego_movie %>% 
  html_node(".star-box-details a:nth-child(4)") %>%
  html_text(trim=TRUE) %>%
  strsplit(., "/") %>%
  unlist(.) %>%
  as.numeric(.) %>% 
  head(., 1)

[1] 83

Or as Frank suggested, you could evaluate the expression "83/100" with something like:

lego_movie %>% 
  html_node(".star-box-details a:nth-child(4)") %>%
  html_text(trim=TRUE) %>%
  parse(text = .) %>%
  eval(.)
[1] 0.83

You can see that before converting into numeric, it returns a " 83/100\\n"

lego_movie %>% 
    html_node(".star-box-details a:nth-child(4)") %>%
     html_text() 
# [1] " 83/100\n"

You can use trim=TRUE to omit \\n . You can't convert this to numeric because you have / . :

lego_movie %>% 
     html_node(".star-box-details a:nth-child(4)") %>%
     html_text(trim=TRUE) 
# [1] "83/100"

If you convert this to numeric, you will get NA with warnings which is not unexpected:

# [1] NA
# Warning message:
# In function_list[[k]](value) : NAs introduced by coercion

If you want the numeric 83 as the final answer, you can use regular expression tools like gsub to remove 100 and \\ (assuming that the full score is 100 for all movies).

lego_movie %>% 
    html_node(".star-box-details a:nth-child(4)") %>%
     html_text(trim=TRUE) %>%
     gsub("100|\\/","",.)%>%
     as.numeric()
# [1] 83

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM