简体   繁体   English

Web 使用 R 进行抓取,并在 web 页面中的 javascript 呈现内容时获取

[英]Web scraping with R and rvest when javascript-rendered content in the web page

I am attempting to scrape the webpage https://www.filmweb.no/kinotoppen/ for title and other information under each movie.我正在尝试抓取网页https://www.filmweb.no/kinotoppen/以获取每部电影下的标题和其他信息。 For other webpages I have been fine with running a couple of lines with html_nodes() and html_text() using SelectorGadget to pick the CSS selectors to get the different things I wanted as such:对于其他网页,我可以使用 SelectorGadget 使用 html_nodes() 和 html_text() 运行几行来选择 CSS 选择器来获得我想要的不同内容:

html <- read_html("https://www.filmweb.no/kinotoppen/")
title <- html %>% 
  html_nodes(".Kinotoppen_MovieTitle__2MFbT") %>% 
  html_text()

However, when running those lines on this webpage I only get an empty character vector.但是,在此网页上运行这些行时,我只会得到一个空字符向量。 Upon inspecting the webpage further I see that it is calling on javascripts.在进一步检查网页后,我发现它正在调用 javascripts。 I tried using html_nodes("script") together with the v8 library to run the javascripts, but to no avail.我尝试将 html_nodes("script") 与 v8 库一起使用来运行 javascripts,但无济于事。 I'm also unsure which scripts to run, so I tried all as such:我也不确定要运行哪些脚本,所以我尝试了所有这些:

ct <- v8()
ct$eval(scripts[3])

Is there an easier way in general to get the webpage into a form where I can just use rvest?一般来说,有没有更简单的方法可以让网页变成我可以使用 rvest 的形式? I do not know anything about javascript.我对 javascript 一无所知。

Here's what it would look like using RSelenium to get the page to load.下面是使用 RSelenium 加载页面的样子。

library(rvest)
library(RSelenium)
remDr <- rsDriver(browser='chrome', port=4444L)
brow <- remDr[["client"]]
brow$open()
brow$navigate("https://www.filmweb.no/kinotoppen/")
h <- brow$getPageSource()
h <- read_html(h[[1]])
h %>% html_nodes(".Kinotoppen_MovieTitle__2MFbT") %>% 
  html_text()
# [1] "Spider-Man: No Way Home"              "Clifford: Den store røde hunden"      "Lise & Snøpels - Venner for alltid"  
# [4] "Familien Voff - alle trenger en venn" "Nightmare Alley"                      "Snødronningen"                       
# [7] "Scream"                               "Bergman Island"                       "Trøffeljegerne fra Piemonte"         
# [10] "Encanto"                             


Data is dynamically retrieved from a graphql query.从 graphql 查询动态检索数据。 You can replicate that query to get the JSON response containing all the desired data.您可以复制该查询以获取包含所有所需数据的 JSON 响应。

In this case I chose to look at using httr2 and the newish pipe operator (R 4.1.0)在这种情况下,我选择使用httr2和新的pipe 运算符(R 4.1.0)

For how to pipe the headers vector I looked at the solution given by @MrFlick here .对于如何 pipe 标头向量,我查看了@MrFlick here给出的解决方案。

library(httr2)

headers = c(
  'Accept' = 'application/json',
  'Referer' = 'https://www.filmweb.no/',
  'Content-Type' = 'application/json',
  'User-Agent' = 'Mozilla/5.0'
)

params = list(
  'query' = 'query($date:String,$chartType:String,$max:Int){movieQuery{getMovieChart(date:$date,chartType:$chartType,max:$max){chartType periodStart periodEnd movieChartItem{pos posPrev admissions admissionsPrev admissionsToDate weeksOnList movie{title mainVersionId premiere poster{name versions{width height url}}}}}}}',
  'variables' = '{"date":"2022-02-04","chartType":"weekend","max":1000}'
)

data <- request("https://skynet.filmweb.no/MovieInfoQs/graphql/") |> 
  (\(x) req_headers(x,  !!!headers))() |>  
 req_url_query(!!!params) |> 
  req_perform() |> 
  resp_body_json()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM