简体   繁体   中英

Read specific file from tar.gz in R

I have a large tar.gz file (>2GB) from which I want to read a specific .dat file in R without unzipping the original tar.gz file.

I tried to follow this post as follows:

p35_data_path <- "~/P35_fullset.tar.gz" 
file.exists(p35_data_path) #TRUE

# Try to readin foldera/class1/mydata.dat from the zip file
mydata <- read.table(unz(p35_data_path
                       , "foldera/class1/mydata.dat"))

When I run the above I get a read.table error as

Error in open.connection(file, "rt") : cannot open the connection
In addition: Warning message:
In open.connection(file, "rt") :
  cannot open zip file '~/P35_fullset.tar.gz'

The "~/P35_fullset.tar.gz" file exists. And the specific file within it definitely exists foldera/class1/mydata.dat .

Could anyone please assist in rectifying this?

da < -untar(Tarfile, files = NULL, list = TRUE, exdir = ".",compressed = "gzip") # this is for listing the files under TAR

da < -as.data.table(da) # save listed files as datatable 

Then use your own filter technique to filtes the files like I did and saved in Name :

g <- c(da$Name)`  # then list the names

untar(Tarfile, files = g, list = FALSE, exdir = "exportRQA",compressed = "gzip") # This is finally the command for extracting the specific files.

Using library(archive) one can read in a particular csv file within an archive without having to UNZIP it first :

library(archive)
library(readr)
read_csv(archive_read("~/P35_fullset.tar.gz", file = 1), col_types = cols())

(adjust file=XX as appropriate)

You should be able to unpack the archive with base R's untar() :

p35_data_path <- "~/P35_fullset.tar.gz" 
file.exists(p35_data_path) #TRUE

# Try to readin foldera/class1/mydata.dat from the .tar.gz file
untar(p35_data_path, "foldera/class1/mydata.dat")  # this extracts the file from archive
mydata <- read.table("foldera/class1/mydata.dat")  # so you can read it

The file is extracted inside the folder, however, you can specify where to extract it. See documentation for more info.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM