简体   繁体   中英

Web Scraping with R: problem with "data.frame" function and number of rows

Briefly, I want to scrape information from this site about movies. I was using Selector Gadget to scrape it and I wrote down this code:

library(dplyr)
library(tidyverse)
library(rvest)
library(readr)
library(purrr)

link = "https://www.imdb.com/search/title/?title_type=feature&num_votes=25000,&genres=adventure&sort=user_rating,desc"
page = read_html(link)

film_name = page %>% html_nodes(".lister-item-header a") %>% html_text()
year = page %>% html_nodes(".text-muted.unbold") %>% html_text()
rating = page %>% html_nodes(".ratings-imdb-rating strong") %>% html_text()
gross_income %>% html_nodes(".ghost~ .text-muted+ span") %>% html_text()
duration = page%>% html_nodes(".runtime") %>% html_text()

IMDB_Adventure_Movies_Rank = data.frame(film_name, year, rating, duration, gross_income, stringsAsFactors = FALSE)

R console gives the following error:

Error in data.frame(film_name, year, rating, duration, gross_income, stringsAsFactors = FALSE) : 
  gli argomenti implicano un numero differente di righe: 50, 44

The error is due to the fact that, in the website, 6 films out of 50 have not the income reported.

I have tried this solution, but the values do not get arranged in the correct order, since R assigns the wrong incomes to each film

length(gross_income) = length(film_name)

My question is: how can I create a table where, in case a film hasn't the income reported, R returns something as NA or null, instead of giving me error?
I saw that a guy had the same problem and the solution was to use the purrr package and the possibly() function. However, I am new to R and I can't understand the answer and how to use possibly() .

I would suggest that you reflect on using imdbapi . imdbapi is a package that facilitates access to IMDB Api. You will need to acquire an API key but the cost of that is fairly insignificant.

library("imdbapi")
res_film <-
    find_by_title("Top Gun: Maverick", api_key = <Your API KEY>)

When working against established data sources such as Eurostat, World Bank of IMDB for that matter is advisable to rely on maintained packages and available APIs. By scraping data from the site using rvest you will have to accomplish a lot of unnecessary work and solve problems that were already solved by the API and package creators.


There is an alternative Open Movie Database that gives you some free queries with a fairly high limit, and offers a dedicated R package . Likely you should be able to acquire the information that you need like that with no cost.

We can get the income of the movies by,

link = "https://www.imdb.com/search/title/?title_type=feature&num_votes=25000,&genres=adventure&sort=user_rating,desc"
df = read_html(link) %>% html_nodes('#main div div.lister.list.detail.sub-list div div.lister-item-content p.sort-num_votes-visible') %>% html_text()
 [1] "\n                Votes:\n                1,766,474\n    |                Gross:\n                $377.85M\n            \n        "
 [2] "\n                Votes:\n                1,788,217\n    |                Gross:\n                $315.54M\n            \n        "
 [3] "\n                Votes:\n                2,253,349\n    |                Gross:\n                $292.58M\n            \n        "
 [4] "\n                Votes:\n                1,595,898\n    |                Gross:\n                $342.55M\n            \n        "

We now get votes and income for each movie. We shall filter income using regex.

library(stringi)
stri_extract_first_regex(df, "(?<=\\$).*")
 [1] "377.85M" "315.54M" "292.58M" "342.55M" "6.10M"   "188.02M" "290.48M" "10.06M"  "210.61M" "322.74M" "678.82M" NA        "187.71M" "422.78M" "190.24M"
[16] "858.37M" "209.73M" "223.81M" "2.38M"   "85.16M"  "248.16M" "47.70M"  "293.00M" "415.00M" "120.54M" "191.80M" "197.17M" "309.13M" NA        "56.95M" 
[31] "44.82M"  "13.28M"  NA        NA        "1.43M"   "356.46M" "381.01M" "4.71M"   "380.84M" "402.45M" "1.23M"   "12.10M"  "44.91M"  NA        "5.01M"  
[46] "1.03M"   "5.45M"   "8.18M"   NA        "59.10M" 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM