简体   繁体   中英

Unable to scrape multiple pages using phantomjs in r

I'm trying to scrape county assessor data on historic property values for multiple parcels generated using javascript from https://www.washoecounty.us/assessor/cama/?command=assessment_data&parid=07101001 using phantomjs controlled by RSelenium. 'paraid' in the url is the 9 digit parcel number. I have a dataframe containing a list of parcel numbers that I'm interested in (a few hundred in total), but have been attempting to make the code work on a small subset of those:

parcel_nums
[1] "00905101" "00905102" "00905103" "00905104" "00905105" 
[6] "00905106" "00905107" "00905108" "00905201" "00905202"

I need to scrape the data in the table generated on the page for each parcel and preserve it. I have chosen to write the page to a file "output.htm" and then parse the file afterwards. My code is as follows:

require(plyr)
require(rvest)
require(RSelenium)
require(tidyr)
require(dplyr)

parcel_nums <- prop_attr$APN[1:10]  #Vector of parcel numbers
pJS <- phantom()
remDr <- remoteDriver(browserName = "phantomjs")
remDr$open()

result <- remDr$phantomExecute("var page = this;
                            var fs = require(\"fs\");
                            page.onLoadFinished = function(status) {
                            var file = fs.open(\"output.htm\", \"w\");
                            file.write(page.content);
                            file.close();
                            };")

for (i in 1:length(parcel_nums)){
    url <- paste("https://www.washoecounty.us/assessor/cama/?command=assessment_data&parid=", 
        parcel_nums[i], sep = "")
    Sys.sleep(5)

    emDr$navigate(url)

    dat <- read_html("output.htm", encoding = "UTF-8") %>% 
        html_nodes("table") %>% 
        html_table(, header = T)
    df <- data.frame(dat)

    #assign parcel number to panel
    df$apn <- parcel_nums[i]
    #on first iteratation initialize final data frame, on sebsequent iterations append the final data frame
    ifelse(i == 1, parcel_data <- df, parcel_data <- rbind(parcel_data, df))
}
remDr$close
pJS$stop()

This will work perfectly for one or two iterations of the loop, but it suddenly stops preserving the data generated by the javascript and produces an error:

 Error in `$<-.data.frame`(`*tmp*`, "apn", value = "00905105") : 
 replacement has 1 row, data has 0 

which is due to the parser not locating the table in the output file because it is not being preserved. I'm unsure if there is a problem with the implementation I've chosen or if there is some idiosycrasy of the particular site that is causing the issue. I am not familiar with JavaScript so the code snippet used is taken from an example I found. Thank you for any assistance.

The below answer worked perfectly. I also moved the Sys.sleep(5) to after the $navigate to allow the page time to load the javascript. The loop is now executing to completion.

require(plyr)
require(rvest)
require(RSelenium)
require(tidyr)
require(dplyr)

parcel_nums <- prop_attr$APN[1:10]  #Vector of parcel numbers
#pJS <- phantom()
remDr <- remoteDriver()
remDr$open()

# #result <- remDr$executeScript("var page = this;
#                                var fs = require(\"fs\");
#                                page.onLoadFinished = function(status) {
#                                var file = fs.open(\"output.htm\", \"w\");
#                                file.write(page.content);
#                                file.close();
#                                };")
#length(parcel_nums)
for (i in 1:length(parcel_nums)){
  url <- paste("https://www.washoecounty.us/assessor/cama/?command=assessment_data&parid=", 
               parcel_nums[i], sep = "")
  Sys.sleep(5)

  remDr$navigate(url)
  doc <- htmlParse(remDr$getPageSource()[[1]])
  doc_t<-readHTMLTable(doc,header = TRUE)$`NULL`
  df<-data.frame(doc_t)

  #assign parcel number to panel
  df$apn <- parcel_nums[i]
  #on first iteratation initialize final data frame, on sebsequent iterations append the final data frame
  ifelse(i == 1, parcel_data <- df, parcel_data <- rbind(parcel_data, df))
}
remDr$close

This gave me a solution. And this should work with the the phantomJS too. I request you to test and reply.

I have lost an entire day trying to solve a similar issue. So I share my learning to help others save time and nerves..

I guess we need to understand that opening, navigating and other browsing actions through the remote driver need time to complete. So we have to wait before we try to read or do anything on the pages we are expecting to scrape.

My problems were solved when I introduced Sys.sleep(5) after the remDr$navigate(url) call.

It seems that a neater solution consists of inserting an remDr$setTimeout(type = "page load", milliseconds = 10000) as suggested at how to check if page finished loading in RSelenium but didn't test it yet.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM