简体   繁体   中英

How to scrape a live java script webpage in R?

I would like to scrape the play by play from http://stats.statbroadcast.com/statmonitr/?id=107165 . The link will bring you to the "Split Box" tab. I am interested in scraping the play by play tab as well as the home stats and the visitors stats tab. One of the problems is that no matter what tab you switch to the url never changes. If you use selector gadget the css-selector for the main contents of all the tabs are the same as well, which is "#stats". I am a novice at web scraping and most of the time I can successfully scrape a html page with the package rvest , but I am unfortunately lost as to how I should proceed with javascript. I have heard of JSON, but I am not sure how to combat the issue of all the tabs having the same url.

My main goal is to be able to scrape the play by play, home stats, and visitor stats tab when the game is live.

Any help would be much appreciated. Please let me know if I should provide more info.

You can use RSelenuim to do that as follows:

require(RSelenium)
RSelenium::startServer()
remDr <- remoteDriver()
remDr$open()
remDr$navigate("http://stats.statbroadcast.com/statmonitr/?id=107165")

Now a firefox window should open where you can browse just like normal. doc <- remDr$getPageSource() gives you the Source-Code of the current webpage. You can use rvest to scrape this Code as follows:

doc <- remDr$getPageSource()[[1]]
require(rvest)
current_doc <- read_html(doc)

If you want to automate the "browsing" you can eg. navigate to the "Play by Play"-Page as follows:

webElem <- remDr$findElement(using = "css selector", '#bb_b6')
remDr$mouseMoveToLocation(webElement = webElem)
remDr$click(1)

At the end: close the remote driver ans shut down selenium-server

#shutdown
remDr$close()
browseURL("http://localhost:4444/selenium-server/driver/?cmd=shutDownSeleniumServer")

For more details see: https://cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-basics.html

Edit: current_doc caputures the website as it is when you execute doc <- remDr$getPageSource()[[1]] . It is NOT a realtime like. It is a 1 time picture.

If you want to scrape "Period I" do as follows: 1st navigate to "Play by Play" (as shown above) - Sys.sleep(3) till the website is loaded - Then navigate to "Period I" the same way you navigated to "Play by Play" just with another css-selector.

Have a look at your remote-driver (aka the browser window you control) if you arrived at the "Period I" webpage.

After you arrived execute doc <- remDr$getPageSource()[[1]] and analyse the content.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM