简体   繁体   中英

r- Reading from zip and matching with values from dataframe column

I'm trying to make one dataframe by reading two datasets but the methodology I`m using is extremely slow - it can take as long as 10 hours to read and process 600Mb of data. I believe there must be a much faster way to do this but I guess I cannot see what seems to be slowing down the process. In the following is a reproducible example to present the steps.

Required packages:

library(tidyverse)

The first set is a .csv file. A sample can be recreated with the following:

info <- data.frame(identification = c("a", "b", "c", "d", "e"), attr = c(0:4))
info %>% write_csv("folder/info.csv") 

The second is a zip file. A sample can be recreated with the following:

a <- data.frame(var = c(41:50), val = c(31:40))
a %>% write_csv("folder/file/a_df.csv")  

b <- data.frame(var = c(41:50), val = c(31:40))
b %>% write_csv("folder/file/b_df.csv")

c <- data.frame(var = c(41:50), val = c(31:40))
c %>% write_csv("folder/file/c_df.csv")

d <- data.frame(var = c(41:50), val = c(31:40))
d %>% write_csv("folder/file/d_df.csv")

e <- data.frame(var = c(41:50), val = c(31:40))
e %>% write_csv("folder/file/e_df.csv")

files2zip <- dir('folder/file/', full.names = TRUE)
zip(zipfile = 'testZip', files = files2zip)

The methodology I use is the following:

 data1 <- read_csv("folder/info.csv")

read_from_zip <- function(identification) {
  fn <- paste0("folder/file/", identification, ".csv")  
  # read zip files
  zip_file <- paste0("./folder/testZip.zip")
  id_2_zip <- unzip( zip_file
                     ,files = fn)  
  df <- read_csv(id_2_zip)
  }

df <- data1 %>% group_by(identification) %>% nest() %>%
  mutate(trj = map(identification, read_from_zip)) 

df <- df %>% select(identification, trj) %>% unnest()

I'd guess something like this would work:

tmpdir <- tempfile()
dir.create(tmpdir)

a convenience vector, if you desire:

filesvec <- paste0(letters[1:5], '.csv')

Note that this needs to be "verbatim" as listed in the zipfile, including any leading directories. (You can use junkpaths=TRUE for unzip() or system('unzip -j ...') to drop the leading paths.) In the past, I've created this vector of filenames from a quick call to unzip(zipfile, list=TRUE) and grep ing the output. This way, if you are careful then you will (a) always know before extraction that a file is missing, and (b) not cause an exception within unzip() or a non-zero return code from system('unzip ...') . You might do:

filesvec <- unzip(zipfile, list=TRUE)
filesvec <- filesvec[ grepl("\\.csv$", filesvec) ]
# some logic to ensure you have some or all of what you need

And then one of the following:

unzip(zipfile, files=filesvec, exdir=tmpdir)
system(paste(c("unzip -d", shQuote(c(tempdir(), 'foo.zip', 'a.csv','b.csv')))))

From here, you can access the files with:

alldata <- sapply(file.path(tmpdir, filesvec), read.csv, simplify=FALSE)

where the names of the list are the filenames (including leading path?), and the contents should all be data.frame s.

When done, whether you clean up the temp files or not is dependent on how OCD you are with temp files. Your OS might clean them up for you after some time. If you are tight on space or just paranoid, you could do a cleanup with:

ign <- sapply(file.path(tmpdir, filesvec), unlink) 
unlink(tmpdir, recursive=TRUE) # remove the temp dir we created

(You could just use the second command, but in case you are using a different temp-directory method, I thought I'd be careful.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM