简体   繁体   English

从使用`saveRDS()`保存的data.frame快速加载行的子集

[英]quickly load a subset of rows from data.frame saved with `saveRDS()`

With a large file (1GB) created by saving a large data.frame (or data.table ) is it possible to very quickly load a small subset of rows from that file? 通过保存大型data.frame (或data.table )创建的大文件(1GB)是否可以非常快速地从该文件加载一小部分行?

( Extra for clarity : I mean something as fast as mmap , ie the runtime should be approximately proportional to the amount of memory extracted, but constant in the size of the total dataset. "Skipping data" should have essentially zero cost. This can be very easy, or impossible, or something in between, depending on the serialiization format. ) 为了清楚起见 :我的意思是和mmap一样快,即运行时应该与提取的内存量大致成比例,但总数据集的大小是恒定的。“跳过数据”应该基本上没有成本。这可以是根据序列化格式,非常容易,或者不可能,或介于两者之间。)

I hope that the R serialization format makes it easy to skip forward through the file to the relevant portions of the file. 我希望R序列化格式可以很容易地将文件跳转到文件的相关部分。

Am I right in assuming that this would be impossible with a compressed file, simply because gzip requires to uncompress everything from the beginning? 我是否正确地假设使用压缩文件这是不可能的,仅仅因为gzip需要从头开始解压缩所有内容?

 saveRDS(object, file = "", ascii = FALSE, version = NULL,
         compress = TRUE, refhook = NULL)

But I'm hoping binary ( ascii=F ) uncompressed ( compress=F ) might allow something like this. 但我希望二进制( ascii=F )未压缩( compress=F )可能会允许这样的东西。 Use mmap on the file, then quickly skip to the rows and columns of interest? 在文件上使用mmap ,然后快速跳转到感兴趣的行和列?

I'm hoping it has already been done, or there is another format (reasonably space efficient) that allows this and is well-supported in R. 我希望它已经完成,或者有另一种格式(合理的空间效率)允许这个并且在R中得到很好的支持。

I've used things like gdbm (from Python) and even implemented a custom system in Rcpp for a specific data structure, but I'm not satisfied with any of this. 我已经使用了像gdbm (来自Python)这样的东西,甚至在Rcpp中为特定的数据结构实现了一个自定义系统,但我对这些都不满意。

After posting this, I worked a bit with the package ff ( CRAN ) and am very impressed with it (not much support for character vectors though). 发布之后,我对包ffCRAN )进行了一些工作,并对它印象深刻(尽管对character向量的支持不多)。

Am I right in assuming that this would be impossible with a compressed file, simply because gzip requires to uncompress everything from the beginning? 我是否正确地假设使用压缩文件这是不可能的,仅仅因为gzip需要从头开始解压缩所有内容?

Indeed, for a short explanation let's take some dummy method as starting point: 实际上,对于一个简短的解释,让我们采用一些虚拟方法作为起点:

AAAAVVBABBBC gzip would do something like: 4A2VBA3BC AAAAVVBABBBC gzip会做类似的事情: 4A2VBA3BC

Obviously you can't extract all A from the file without reading it all as you can't guess if there's an A at end or not. 显然你不能从文件中提取所有A而不读取它,因为你无法猜测是否有A结尾。

For the other question "Loading part of a saved file" I can't see a solution on top of my head. 对于另一个问题“加载已保存文件的一部分”,我无法在头脑中看到解决方案。 You probably can with write.csv and read.csv (or fwrite and fread from the data.table package) with skip and nrows parameters could be an alternative. 您可以使用write.csvread.csv (或者来自data.table包的fwritefread )使用skipnrows参数作为替代方法。

By all means, using any function on a file already read would mean loading the whole file in memory before filtering, which is no more time than reading the file and then subsetting from memory. 无论如何,在已经读取的文件上使用任何函数都意味着在过滤之前将整个文件加载到内存中,这不再是读取文件然后从内存中进行子集化的时间。

You may craft something in Rcpp, taking advantage of streams for reading data without loading them in memory, but reading and parsing each entry before deciding if it should be kept or not won't give you a real better throughput. 您可以在Rcpp中创建一些东西,利用流来读取数据而不将它们加载到内存中,但是在决定是否应该保留之前读取和解析每个条目都不会给您带来真正更好的吞吐量。

saveDRS will save a serialized version of the datas, example: saveDRS将保存数据的序列化版本,例如:

> myvector <- c("1","2","3").
> serialize(myvector,NULL)
 [1] 58 0a 00 00 00 02 00 03 02 03 00 02 03 00 00 00 00 10 00 00 00 03 00 04 00 09 00 00 00 01 31 00 04 00 09 00 00 00 01 32 00 04 00 09 00 00
[47] 00 01 33

It is of course parsable, but means reading byte per byte according to the format. 它当然是可解析的,但意味着根据格式读取每字节的字节数。

On the other hand, you could write as csv (or write.table for more complex data) and use an external tool before reading, something along the line: 另一方面,您可以编写为csv(或更复杂数据的write.table ),并在阅读之前使用外部工具,沿着这条线:

z <- tempfile()
write.table(df, z, row.names = FALSE)
shortdf <- read.table(text= system( command = paste0( "awk 'NR > 5 && NR < 10 { print }'" ,z) ) )

You'll need a linux system with wich is able to parse millions of lines in a few milliseconds, or to use a windows compiled version of obviously. 你需要一个带有的linux系统,能够在几毫秒内解析数百万行,或者显然使用windows编译的版本。

Main advantage is that is able to filter on a regex or some other conditions each line of data. 主要优点是能够在每个数据行上对正则表达式或其他条件进行过滤。

Complement for case of data.frame, a data.frame is more or less a list of vectors (simple case), this list will be saved sequentially so if we have a dataframe like: 对于data.frame的情况的补充,data.frame或多或少是一个向量列表(简单情况),这个列表将按顺序保存,所以如果我们有一个数据帧,如:

> str(ex)
'data.frame':   3 obs. of  2 variables:
 $ a: chr  "one" "five" "Whatever"
 $ b: num  1 2 3

It's serialization is: 它的序列化是:

> serialize(ex,NULL)
  [1] 58 0a 00 00 00 02 00 03 02 03 00 02 03 00 00 00 03 13 00 00 00 02 00 00 00 10 00 00 00 03 00 04 00 09 00 00 00 03 6f 6e 65 00 04 00 09 00
 [47] 00 00 04 66 69 76 65 00 04 00 09 00 00 00 08 57 68 61 74 65 76 65 72 00 00 00 0e 00 00 00 03 3f f0 00 00 00 00 00 00 40 00 00 00 00 00 00
 [93] 00 40 08 00 00 00 00 00 00 00 00 04 02 00 00 00 01 00 04 00 09 00 00 00 05 6e 61 6d 65 73 00 00 00 10 00 00 00 02 00 04 00 09 00 00 00 01
[139] 61 00 04 00 09 00 00 00 01 62 00 00 04 02 00 00 00 01 00 04 00 09 00 00 00 09 72 6f 77 2e 6e 61 6d 65 73 00 00 00 0d 00 00 00 02 80 00 00
[185] 00 ff ff ff fd 00 00 04 02 00 00 00 01 00 04 00 09 00 00 00 05 63 6c 61 73 73 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 0a 64 61 74 61
[231] 2e 66 72 61 6d 65 00 00 00 fe

Translated to ascii for an idea: 转换为ascii的想法:

X
    one five    Whatever?ð@@    names   a   b       row.names
ÿÿÿý    class   
data.frameþ

We have the header of the file, the the header of the list, then each vector composing the list, as we have no clue on how much size the character vector will take we can't skip to arbitrary datas, we have to parse each header (the bytes just before the text data give it's length). 我们有文件的标题,列表的标题,然后每个向量组成列表,因为我们不知道字符向量将采用多少大小我们不能跳到任意数据,我们必须解析每个header(文本数据之前的字节给它的长度)。 Even worse now to get the corresponding integers, we have to go to the integer vector header, which can't be determined without parsing each character header and summing them. 更糟糕的是现在要获得相应的整数,我们必须转到整数向量标头,如果不解析每个字符标头并对它们求和,就无法确定。

So in my opinion, crafting something is possible but will probably not be really much quicker than reading all the object and will be brittle to the save format (as R has already 3 formats to save objects). 所以在我看来,制作一些东西是可能的,但可能不会比读取所有对象快得多,并且对于保存格式会很脆弱(因为R已经有3种格式来保存对象)。

Some reference here 这里有一些参考

Same view as the serialize output in ascii format (more readable to get how it is organized): 与ascii格式的序列化输出相同的视图(更具可读性以获得它的组织方式):

> write(rawToChar(serialize(ex,NULL,ascii=TRUE)),"")
A
2
197123
131840
787
2
16
3
262153
3
one
262153
4
five
262153
8
Whatever
14
3
1
2
3
1026
1
262153
5
names
16
2
262153
1
a
262153
1
b
1026
1
262153
9
row.names
13
2
NA
-3
1026
1
262153
5
class
16
1
262153
10
data.frame
254

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM