Sorry, this might be too involved a question to ask here. I'm trying to reproduce the Hack Session for NYTime Dialect Map Visualisation, located here . I'm OK in the beginning, but then I run into a problem when I try to scape multiple pages.
To save people from having to reproduce info from the slides, this is what I have so far:
Create URL addresses :
mainURL <- 'http://www4.uwm.edu/FLL/linguistics/dialect/staticmaps/'
stateURL <- 'states.html'
url <- paste0(mainURL, stateURL)
Download and Parse
tmp <- getURL(url)
tmp <- htmlTreeParse(tmp, useInternalNodes = TRUE)
Extract page addresses and save to subURL
subURL <- unlist(xpathSApply(tmp, '//a[@href]', xmlAttrs))
Remove pages that aren't state's names
subURL <- subURL[-(1:4)]
The problem begins for me on slide 24 in the original. The slides say that the next step is to loop over the list of states and read the body of each question. Of course, we also need to save the name of each state in the process . The loop is initialized with the following code:
survey <- vector(length(subURL), mode = "list")
i = 1
stateNames <- rep('', length(subURL))
Underneath this code, the slide says that survey
is a list where information about every state is saved . I'm a little puzzled here about how that is the case, since survey
is indeed a list with a length of 51, but every element is NULL. I'm also puzzled by what the i
is doing here (and this becomes important later). Still, I can follow what the code is doing, and I assumed that the list would get populated later.
It's really the next slide where I get confused. As an example, it is shown how the URL contains the name of each state, using Alaska as an example:
Create URL for the first state and assign to suburl
suburl <- subURL[1]
Remove state_ from suburl
stateName <- gsub('state_','',suburl)
Remove .html from stateName
stateName <- gsub('.html','',stateName)
So far, so good. I can do this for each state individually. However, I can't figure out how to turn this into a loop that would apply to all the states. The slide only has the following code:
stateNames[i] <- stateName
This is where I am stuck. The previous slide assigned 1 to i
, so the only thing this does is get the name for Alaska (AK), but every other element is "" (as one expect, given how stateNames
was defined previously).
I did try the following:
stateNames <- gsub('state_','',subURL)
stateNames <-gsub('.html','',stateNames)
This doesn't quite work, because the lengths of this vector is 51, but the length of the one shown above is only 1. (Later, I want each state to have its own name, not for all the states to have the same 51 state name). Moreover, I didn't know what to do with the stateNames(i) <- stateName
command.
Anyways, I kept working through to the end (both with the original, and the modification), hoping that things would eventually right themselves (and at times I got the same as what was on the presentation), but eventually things just broke). I think there is an additional problem later on in the slides (an object is subsetted that didn't exist before), but I'm guessing a problem also arises from a problem that occurs much easier.
Anyways, I know this is a pretty involved question, so I apologize if it is inappropriate for this site. I'm just stuck.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.