ReadLines using multiple sources in R

Question

I'm trying to use readLines() to scrape .txt files hosted by the Census and compile them into one .txt/.csv file. I am able to use it to read individual pages but I'd like to have it so that I can just run a function that will go out and readLines() based on a csv with urls.

My knowledge of looping and function properties isn't great, but here are the pieces of my code that I'm trying to incorporate:

Here is how I build my matrix of urls which I can add to and/or turn into a csv and have a function read it that way.

MasterList <- matrix( data = c("%20Region/ne0001y.txt", "%20Region/ne0002y.txt", "%20Region/ne0003y.txt"), ncol = 1)
urls <- sprintf("http://www2.census.gov/econ/bps/Place/Northeast%s", MasterList)

Here's the function (riddled with problems) I started writing:

Scrape <- function(x){
  for (i in x){
      URLS <- i
      headers <- readLines(URLS, n=2)
      bod <- readLines(URLS)
      bodclipped <- bod[-c(1,2,3)]
      Totes <- c(headers, bodclipped)
      write(Totes, file = "[Directory]/ScrapeTest.txt")
      return(head(Totes))
  }
}

The idea being that I would run Scrape(urls) which would generate a cumulation of the 3 urls I have in my "urls" matrix/csv with the Census' build in headers removed from all files except the first one (headers vs. bodclipped).

I've tried doing lapply() to "urls" with readLines but that only generates text based on the last url and not all three, and they still have the headers for each text file which I could just remove and then reattach at the end.

Any help would be appreciated!

Answer 1

As all of these documents are csv files with 38 columns you can combine then very easily using:

MasterList <- c("%20Region/ne0001y.txt", "%20Region/ne0002y.txt", "%20Region/ne0003y.txt")
urls <- sprintf("http://www2.census.gov/econ/bps/Place/Northeast%s", MasterList)

raw_dat <- lapply(urls, read.csv, skip = 3, header = FALSE)
dat <- do.call(rbind, dat_raw)

What happens here and how is this looping? The lapply function basically creates a list with 3 (= length(urls) ) entries and populates them with: read.csv(urls[i], skip = 3, header = FALSE) . So raw_dat is a list with 3 data.frames containing your data. do.call(rbind, dat) binds em together.

The header row seams somehow broken thats why i use skip = 3, header = FALSE which is equivalent to your bod[-c(1,2,3)] .

If all the scraped data fits into memory you can combine it this way and in the end write it into a file using:

write.csv(dat, "[Directory]/ScrapeTest.txt")

ReadLines using multiple sources in R

Question

1 answers

solution1
0 ACCPTED 2017-03-27 17:35:42

ReadLines using multiple sources in R

Question

1 answers

solution1 0 ACCPTED 2017-03-27 17:35:42

solution1
0 ACCPTED 2017-03-27 17:35:42