[英]R: Read single file from within a tar.gz directory
Consider a tar.gz file of a directory which containing a lot of individual files.考虑一个包含大量单个文件的目录的 tar.gz 文件。
From within RI can easily extract the name of the individual files with this command:从 RI 中可以使用以下命令轻松提取单个文件的名称:
fileList <- untar(my_tar_dir.tar.gz, list=T)
Using only R is it possible to directly read/load a single of those files into R (aka without first unpacking and writing the file to the disk)?仅使用 R是否可以直接将其中一个文件读取/加载到 R 中(也就是无需先解压缩文件并将其写入磁盘)?
It is possible, but I don't know of any clean implementation (it may exist).这是可能的,但我不知道任何干净的实现(它可能存在)。 Below is some very basic R code that should work in many cases (eg file names with full path inside the archive should be less than 100 characters).
下面是一些非常基本的 R 代码,在许多情况下都可以使用(例如,存档中包含完整路径的文件名应少于 100 个字符)。 In a way, it's just re-implementing "untar" in an extremely crude way, but in such a way that it will point to the desired file in a gzipped file.
在某种程度上,它只是以一种极其粗糙的方式重新实现“解压缩”,但它会指向一个 gzip 压缩文件中的所需文件。
The first problem is that you should only read a gzipped file from the start.第一个问题是您应该只从一开始就读取 gzip 压缩文件。 Using "seek()" to re-position the file pointer to the desired file is, unfortunately, erratic in a gzipped file.
不幸的是,使用“seek()”将文件指针重新定位到所需文件在 gzip 压缩文件中是不稳定的。
ParseTGZ<- function(archname){
# open tgz archive
tf <- gzfile(archname, open='rb')
on.exit(close(tf))
fnames <- list()
offset <- 0
nfile <- 0
while (TRUE) {
# go to beginning of entry
# never use "seek" to re-locate in a gzipped file!
if (seek(tf) != offset) readBin(tf, what="raw", n= offset - seek(tf))
# read file name
fName <- rawToChar(readBin(tf, what="raw", n=100))
if (nchar(fName)==0) break
nfile <- nfile + 1
fnames <- c(fnames, fName)
attr(fnames[[nfile]], "offset") <- offset+512
# read size, first skip 24 bytes (file permissions etc)
# again, we only use readBin, not seek()
readBin(tf, what="raw", n=24)
# file size is encoded as a length 12 octal string,
# with the last character being '\0' (so 11 actual characters)
sz <- readChar(tf, nchars=11)
# convert string to number of bytes
sz <- sum(as.numeric(strsplit(sz,'')[[1]])*8^(10:0))
attr(fnames[[nfile]], "size") <- sz
# cat(sprintf('entry %s, %i bytes\n', fName, sz))
# go to the next message
# don't forget entry header (=512)
offset <- offset + 512*(ceiling(sz/512) + 1)
}
# return a named list of characters strings with attributes?
names(fnames) <- fnames
return(fnames)
}
This will give you the exact position and length of all files in the tar.gz archive.这将为您提供 tar.gz 存档中所有文件的确切位置和长度。 Now the next step is to actually extact a single file.
现在下一步是实际提取单个文件。 You may be able to do this by using a "gzfile" connection directly, but here I will use a rawConnection().
您可以通过直接使用“gzfile”连接来做到这一点,但在这里我将使用 rawConnection()。 This presumes your files fit into memory.
这假定您的文件适合内存。
extractTGZ <- function(archfile, filename) {
# this function returns a raw vector
# containing the desired file
fp <- ParseTGZ(archfile)
offset <- attributes(fp[[filename]])$offset
fsize <- attributes(fp[[filename]])$size
gzf <- gzfile(archfile, open="rb")
on.exit(close(gzf))
# jump to the byte position, don't use seek()
# may be a bad idea on really large archives...
readBin(gzf, what="raw", n=offset)
# now read the data into a raw vector
result <- readBin(gzf, what="raw", n=fsize)
result
}
now, finally:现在,终于:
ff <- rawConnection(ExtractTGZ("myarchive", "myfile"))
Now you can treat ff
as if it were (a connection pointing to) your file.现在您可以将
ff
视为(指向)您的文件的连接。 But it only exists in memory.但它只存在于记忆中。
One can read in a csv within an archive using library(archive)
as follows (this should be a lot more elegant than the currently accepted answer, this package also supports all major archive formats - 'tar', 'ZIP', '7-zip', 'RAR', 'CAB', 'gzip', 'bzip2', 'compress', 'lzma' & 'xz' and it works on all platforms):可以使用
library(archive)
在存档中读取 csv,如下所示(这应该比当前接受的答案更优雅,这个包还支持所有主要的存档格式 - 'tar'、'ZIP'、'7- zip', 'RAR', 'CAB', 'gzip', 'bzip2', 'compress', 'lzma' & 'xz' 它适用于所有平台):
library(archive)
library(readr)
read_csv(archive_read("my_tar_dir.tar.gz", file = 1), col_types = cols())
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.