简体   繁体   中英

Loop multiple webpages in R

Sorry, this might be too involved a question to ask here. I'm trying to reproduce the Hack Session for NYTime Dialect Map Visualisation, located here . I'm OK in the beginning, but then I run into a problem when I try to scape multiple pages.

To save people from having to reproduce info from the slides, this is what I have so far:

Create URL addresses :

mainURL <- 'http://www4.uwm.edu/FLL/linguistics/dialect/staticmaps/'
stateURL <- 'states.html'
url  <-  paste0(mainURL, stateURL)

Download and Parse

tmp <- getURL(url)
tmp  <-  htmlTreeParse(tmp, useInternalNodes = TRUE)

Extract page addresses and save to subURL

subURL  <-  unlist(xpathSApply(tmp, '//a[@href]', xmlAttrs))

Remove pages that aren't state's names

subURL  <- subURL[-(1:4)]

The problem begins for me on slide 24 in the original. The slides say that the next step is to loop over the list of states and read the body of each question. Of course, we also need to save the name of each state in the process . The loop is initialized with the following code:

survey <- vector(length(subURL), mode = "list")
i = 1
stateNames <-  rep('', length(subURL))

Underneath this code, the slide says that survey is a list where information about every state is saved . I'm a little puzzled here about how that is the case, since survey is indeed a list with a length of 51, but every element is NULL. I'm also puzzled by what the i is doing here (and this becomes important later). Still, I can follow what the code is doing, and I assumed that the list would get populated later.

It's really the next slide where I get confused. As an example, it is shown how the URL contains the name of each state, using Alaska as an example:

Create URL for the first state and assign to suburl

 suburl  <- subURL[1]

Remove state_ from suburl

 stateName <- gsub('state_','',suburl)

Remove .html from stateName

 stateName <- gsub('.html','',stateName)

So far, so good. I can do this for each state individually. However, I can't figure out how to turn this into a loop that would apply to all the states. The slide only has the following code:

 stateNames[i] <- stateName

This is where I am stuck. The previous slide assigned 1 to i , so the only thing this does is get the name for Alaska (AK), but every other element is "" (as one expect, given how stateNames was defined previously).

I did try the following:

 stateNames <- gsub('state_','',subURL)
 stateNames <-gsub('.html','',stateNames)

This doesn't quite work, because the lengths of this vector is 51, but the length of the one shown above is only 1. (Later, I want each state to have its own name, not for all the states to have the same 51 state name). Moreover, I didn't know what to do with the stateNames(i) <- stateName command.

Anyways, I kept working through to the end (both with the original, and the modification), hoping that things would eventually right themselves (and at times I got the same as what was on the presentation), but eventually things just broke). I think there is an additional problem later on in the slides (an object is subsetted that didn't exist before), but I'm guessing a problem also arises from a problem that occurs much easier.

Anyways, I know this is a pretty involved question, so I apologize if it is inappropriate for this site. I'm just stuck.

I believe I got this to work. See the gist or see here for the solution.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM