简体   繁体   English

r-从zip读取并与dataframe列中的值匹配

[英]r- Reading from zip and matching with values from dataframe column

I'm trying to make one dataframe by reading two datasets but the methodology I`m using is extremely slow - it can take as long as 10 hours to read and process 600Mb of data. 我正在尝试通过读取两个数据集来制作一个数据帧,但是我使用的方法极其缓慢-读取和处理600Mb数据可能需要长达10个小时的时间。 I believe there must be a much faster way to do this but I guess I cannot see what seems to be slowing down the process. 我相信必须有一种更快的方法来完成此操作,但是我想我看不到什么似乎正在减慢该过程。 In the following is a reproducible example to present the steps. 下面是呈现这些步骤的可复制示例。

Required packages: 所需软件包:

library(tidyverse)

The first set is a .csv file. 第一组是.csv文件。 A sample can be recreated with the following: 可以使用以下方法重新创建示例:

info <- data.frame(identification = c("a", "b", "c", "d", "e"), attr = c(0:4))
info %>% write_csv("folder/info.csv") 

The second is a zip file. 第二个是zip文件。 A sample can be recreated with the following: 可以使用以下方法重新创建示例:

a <- data.frame(var = c(41:50), val = c(31:40))
a %>% write_csv("folder/file/a_df.csv")  

b <- data.frame(var = c(41:50), val = c(31:40))
b %>% write_csv("folder/file/b_df.csv")

c <- data.frame(var = c(41:50), val = c(31:40))
c %>% write_csv("folder/file/c_df.csv")

d <- data.frame(var = c(41:50), val = c(31:40))
d %>% write_csv("folder/file/d_df.csv")

e <- data.frame(var = c(41:50), val = c(31:40))
e %>% write_csv("folder/file/e_df.csv")

files2zip <- dir('folder/file/', full.names = TRUE)
zip(zipfile = 'testZip', files = files2zip)

The methodology I use is the following: 我使用的方法如下:

 data1 <- read_csv("folder/info.csv")

read_from_zip <- function(identification) {
  fn <- paste0("folder/file/", identification, ".csv")  
  # read zip files
  zip_file <- paste0("./folder/testZip.zip")
  id_2_zip <- unzip( zip_file
                     ,files = fn)  
  df <- read_csv(id_2_zip)
  }

df <- data1 %>% group_by(identification) %>% nest() %>%
  mutate(trj = map(identification, read_from_zip)) 

df <- df %>% select(identification, trj) %>% unnest()

I'd guess something like this would work: 我猜这样的东西会起作用:

tmpdir <- tempfile()
dir.create(tmpdir)

a convenience vector, if you desire: 便利向量,如果您希望:

filesvec <- paste0(letters[1:5], '.csv')

Note that this needs to be "verbatim" as listed in the zipfile, including any leading directories. 请注意,这需要是zipfile中列出的“ verbatim”,包括所有前导目录。 (You can use junkpaths=TRUE for unzip() or system('unzip -j ...') to drop the leading paths.) In the past, I've created this vector of filenames from a quick call to unzip(zipfile, list=TRUE) and grep ing the output. (您可以对unzip()system('unzip -j ...')使用junkpaths=TRUE删除前导路径。)过去,我是通过快速调用unzip(zipfile, list=TRUE)创建此文件名矢量的unzip(zipfile, list=TRUE)grep荷兰国际集团的输出。 This way, if you are careful then you will (a) always know before extraction that a file is missing, and (b) not cause an exception within unzip() or a non-zero return code from system('unzip ...') . 这样,如果您小心的话,您将(a)在提取之前始终知道文件丢失,并且(b)不会在unzip()内引起异常,也不会在system('unzip ...') You might do: 您可以这样做:

filesvec <- unzip(zipfile, list=TRUE)
filesvec <- filesvec[ grepl("\\.csv$", filesvec) ]
# some logic to ensure you have some or all of what you need

And then one of the following: 然后执行下列操作之一

unzip(zipfile, files=filesvec, exdir=tmpdir)
system(paste(c("unzip -d", shQuote(c(tempdir(), 'foo.zip', 'a.csv','b.csv')))))

From here, you can access the files with: 在这里,您可以使用以下命令访问文件:

alldata <- sapply(file.path(tmpdir, filesvec), read.csv, simplify=FALSE)

where the names of the list are the filenames (including leading path?), and the contents should all be data.frame s. 列表的名称是文件名(包括前导路径?),其内容都应该是data.frame

When done, whether you clean up the temp files or not is dependent on how OCD you are with temp files. 完成后,是否清理临时文件取决于使用临时文件的OCD状况。 Your OS might clean them up for you after some time. 一段时间后,您的操作系统可能会为您清理它们。 If you are tight on space or just paranoid, you could do a cleanup with: 如果空间有限或只是偏执,可以使用以下方法进行清理:

ign <- sapply(file.path(tmpdir, filesvec), unlink) 
unlink(tmpdir, recursive=TRUE) # remove the temp dir we created

(You could just use the second command, but in case you are using a different temp-directory method, I thought I'd be careful.) (您可以只使用第二个命令,但是如果您使用其他临时目录方法,我想我会小心的。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM