简体   繁体   中英

Scraping HTML (or JavaScript) Table

I'm trying to scrap a table on a website, but can't succeed... I've already done that numerous time, it always worked, but since time the table seems to be in some sort of a Javascript, and the parsing doesn't work at all? Can someone help me?

The page is here .

I already tried the usual:

readHTMLTable(doc//table[@id='live-player-home-offensive-grid'], as.data.frame=TRUE, header=FALSE)
# or
xpathSApply(pagetree, "//*/table[@id='live-player-home-offensive-grid']", xmlValue)

The problem is that the data is not in the table, but in the Javascript code -- it is only put in the table when the page is rendered, in your browser.

I do not see a clean way of extracting it, short of using Javacript tools or web browser controllers ( Zombie.js , CasperJS , PhantomJS , Selenium ).

The following reads the HTML page as a string, and looks for the definition of the initialData variable, that apparently contains the data. It returns the data in the same hard-to-use format, a list of lists of lists of lists of lists of lists of lists.

library(RCurl)
url <- "http://www.whoscored.com/Matches/411429/LiveStatistics/England-Premier-League-2010-2011-Fulham-Arsenal"
html <- getURL(url)
initial_data <- gsub("^.*?initialData = (.*?);.*", "\\1", html)
initial_data <- gsub("'", '"', initial_data)
library(RJSONIO)
data <- fromJSON( initial_data )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM