简体   繁体   English

使用R中的多个源的ReadLines

[英]ReadLines using multiple sources in R

I'm trying to use readLines() to scrape .txt files hosted by the Census and compile them into one .txt/.csv file. 我正在尝试使用readLines()刮擦人口普查托管的.txt文件,并将它们编译为一个.txt / .csv文件。 I am able to use it to read individual pages but I'd like to have it so that I can just run a function that will go out and readLines() based on a csv with urls. 我可以使用它来读取单个页面,但是我想拥有它,这样我就可以运行一个函数,该函数将基于带有url的csv运行out和readLines()。

My knowledge of looping and function properties isn't great, but here are the pieces of my code that I'm trying to incorporate: 我对循环和函数属性的知识不是很丰富,但是下面是我尝试合并的代码片段:

Here is how I build my matrix of urls which I can add to and/or turn into a csv and have a function read it that way. 这是我建立网址矩阵的方法,可以将它们添加到和/或变成csv并让函数以这种方式读取它。

MasterList <- matrix( data = c("%20Region/ne0001y.txt", "%20Region/ne0002y.txt", "%20Region/ne0003y.txt"), ncol = 1)
urls <- sprintf("http://www2.census.gov/econ/bps/Place/Northeast%s", MasterList)

Here's the function (riddled with problems) I started writing: 这是我开始编写的函数(充满问题):

Scrape <- function(x){
  for (i in x){
      URLS <- i
      headers <- readLines(URLS, n=2)
      bod <- readLines(URLS)
      bodclipped <- bod[-c(1,2,3)]
      Totes <- c(headers, bodclipped)
      write(Totes, file = "[Directory]/ScrapeTest.txt")
      return(head(Totes))
  }
}

The idea being that I would run Scrape(urls) which would generate a cumulation of the 3 urls I have in my "urls" matrix/csv with the Census' build in headers removed from all files except the first one (headers vs. bodclipped). 我的想法是,我将运行Scrape(urls)来生成我在“ urls”矩阵/ csv中拥有的3个URL的累积,同时从第一个文件中删除所有文件中除所有人口普查的内置标头(标头与bodclipped )。

I've tried doing lapply() to "urls" with readLines but that only generates text based on the last url and not all three, and they still have the headers for each text file which I could just remove and then reattach at the end. 我尝试用readLines对lapply()进行“ urls”处理,但是它仅基于最后一个url而不是所有三个url生成文本,并且它们仍然具有每个文本文件的标头,我可以将其删除然后重新附加到末尾。

Any help would be appreciated! 任何帮助,将不胜感激!

As all of these documents are csv files with 38 columns you can combine then very easily using: 由于所有这些文档都是38列的csv文件,因此您可以使用以下命令轻松合并:

MasterList <- c("%20Region/ne0001y.txt", "%20Region/ne0002y.txt", "%20Region/ne0003y.txt")
urls <- sprintf("http://www2.census.gov/econ/bps/Place/Northeast%s", MasterList)

raw_dat <- lapply(urls, read.csv, skip = 3, header = FALSE)
dat <- do.call(rbind, dat_raw)

What happens here and how is this looping? 这里发生了什么,循环如何? The lapply function basically creates a list with 3 (= length(urls) ) entries and populates them with: read.csv(urls[i], skip = 3, header = FALSE) . lapply函数基本上会创建一个包含3(= length(urls) )个条目的列表,并使用以下read.csv(urls[i], skip = 3, header = FALSE)填充它们: read.csv(urls[i], skip = 3, header = FALSE) So raw_dat is a list with 3 data.frames containing your data. 因此raw_dat是一个包含3个data.frames的列表,其中包含您的数据。 do.call(rbind, dat) binds em together. do.call(rbind, dat)将em绑定在一​​起。

The header row seams somehow broken thats why i use skip = 3, header = FALSE which is equivalent to your bod[-c(1,2,3)] . 标题行的接缝以某种方式折断了,这就是为什么我使用skip = 3, header = FALSE ,它等效于您的bod[-c(1,2,3)]

If all the scraped data fits into memory you can combine it this way and in the end write it into a file using: 如果所有抓取的数据都适合内存,则可以通过以下方式将其合并,最后使用以下命令将其写入文件:

write.csv(dat, "[Directory]/ScrapeTest.txt")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM