简体   繁体   English

从 .RData 文件加载前 N 行

[英]Load first N rows from an .RData file

I googled around, but I could not find an answer to my question.我用谷歌搜索,但我找不到我的问题的答案。 Functions like scan ( base package) and fread ( data.table package) do a very good job in reading just the first N lines from a .txt or .csv specified by the user.scanbase包)和freaddata.table包)这样的函数在从用户指定的 .txt 或 .csv 中读取前 N 行方面做得非常好。 However, when it comes to .RData, load loads the entire file and there is no way to specify how many values shall be read from it.但是,当涉及到 .RData 时, load会加载整个文件,并且无法指定从中读取多少个值。

I have .RData files which are over 3GB of size, each containing a single data.frame or data.table , and don't always need to load the entire file, but just, say, the first 100 or 1,000 rows of the object.我有超过 3GB 的 .RData 文件,每个文件都包含一个data.framedata.table ,并不总是需要加载整个文件,而只是,比如说,对象的前 100 或 1,000 行. Is there a way to do this?有没有办法做到这一点?

My guess is there isn't an out-of-the-box solution for this.我的猜测是没有现成的解决方案。

If we look at a sample, ASCII-encoded, not compressed, RDS file, we see that it is stored in column major order:如果我们查看一个示例、ASCII 编码、未压缩的 RDS 文件,我们会看到它以列主要顺序存储:

saveRDS(mtcars[1:5, 1:2], "testrds.rds", ascii = TRUE, compress = FALSE)

Yields this file (with comments inserted by me)产生这个文件(我插入了注释)

A        ## ASCII file
3        ## some version info and ??
262146
197888
6
CP1252
787
2
14
5       ## This seems to indicate 5 items in this vector (column)
21      ## first column starts here (but how would you know?)
21
22.8
21.4
18.7    ## first column ends here
14
5       ## Again, This seems to indicate 5 items in this vector (column)
6       ## second column starts here
6
4
6
8       ## second column ends here
1026
1
262153    # Attributes start here: names, row.names, class 
5
names                ## col names
16
2
262153
3
mpg                  ### first col name
262153
3
cyl                  ### second col name
1026
1
262153
9
row.names            ## 2nd attribute: row.names 
16
5
262153
9
Mazda\040RX4         ### first row name
262153
13
Mazda\040RX4\040Wag  ### second row name
262153
10
Datsun\040710        ### ...
262153
14
Hornet\0404\040Drive
262153
17
Hornet\040Sportabout ### last row name
1026
1
262153
5
class                ## 3rd attribute: class
16
1
262153
10
data.frame           ### value of class
254

As you can see with this simple RDS file, reading the first few rows of data still requires parsing the whole file, and would involve knowing which rows to skip over.正如您在这个简单的 RDS 文件中看到的那样,读取前几行数据仍然需要解析整个文件,并且需要知道要跳过哪些行。 And you'd want more documentation of RDS files than is in the R Internals doc.并且您需要比R Internals文档中更多的 RDS 文件文档。

Based on this simple example, one could probably make some guesses and get a rough draft function working for RDS files you know are data frames, but it would take a bit of work--and a lot more work if you wanted to make sure it's robust enough to handle more complex data frames (eg, with factor and Date columns).基于这个简单的例子,人们可能会做出一些猜测并得到一个粗略的草稿函数,它适用于你知道是数据帧的 RDS 文件,但这需要一些工作——如果你想确保它是足够强大以处理更复杂的数据帧(例如,具有factorDate列)。 If you have RData files, they will have a similar but slightly more complex format as they can handle multiple objects.如果您有 RData 文件,它们将具有类似但稍微复杂一些的格式,因为它们可以处理多个对象。

All-in-all, I think RDS and RData are poor choices for data you might want to partially load.总而言之,我认为 RDS 和 RData 是您可能想要部分加载的数据的糟糕选择。 You'd do better with a CSV or TSV, and then you could use the standard options you mention in your question (or vroom::vroom ) to load only the data you want into memory.您最好使用 CSV 或 TSV,然后您可以使用您在问题中提到的标准选项(或vroom::vroom )仅将您想要的数据加载到内存中。

What about this simple work around?这个简单的工作怎么样?

my_data <- head(readRDS("my_data.RDS"), n = 1000)

Set the n parameter of head() to whatever you need.head()n参数设置为您需要的任何参数。

You could even make yourself a little function if you plan to do this a lot.如果你打算经常这样做,你甚至可以让自己成为一个小功能。

read_rds <- function(file, n) {
  # note file can either be a connection object or a character string containing a path
  return(head(readRDS(file), n))
} 

尝试 read_lines_raw:

first_1000 <- read_lines_raw(rdata_filename,skip=0,n_max = 1000)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM