简体繁体 English

在第一列的R中查询内存不足的60gb tsv，哪个数据库/方法？

[英]Querying out of memory 60gb tsv's in R on the first column, which database/method?

原文 2019-02-15 20:47:43 8 1 r/ database/ dataframe/ bigdata/ ff

I have 6 large tsv's matrices of 60gb (uncompressed) containing 20million rows x 501 columns: the first index/integer column that is basically the row number (so not even necessary), 500 columns are numerical (float, 4 decimals eg 1.0301). 我有60gb（未压缩）的6个大型tsv矩阵，其中包含2000万行x 501列：第一个索引/整数列基本上是行号（因此甚至没有必要），500列是数字（浮点数，4个小数，例如1.0301）。 All tsv's have the same number of rows that correspond to each other. 所有tsv都有彼此对应的相同行数。

I need to extract rows on rownumber. 我需要提取行号上的行。

I need is to extract up to 5,000 contiguous rows or up to 500 non-contiguous rows;so not millions. 我需要提取最多5,000个连续的行或最多500个非连续的行；因此不要提取数百万个。 Hopefully, also have some kind of compression to reduce the size of 60gb so maybe no SQL? 希望也可以进行某种压缩以减少60gb的大小，所以也许没有SQL？ What would be the best way to do this? 最好的方法是什么？

One method I tried is to seperate them into 100 gzipped files, index them using tabix, and then query them, but this is too slow for my need (500 random rows took 90 seconds). 我尝试的一种方法是将它们分成100个压缩文件，使用tabix对其进行索引，然后对其进行查询，但这对我来说太慢了（500个随机行花费了90秒）。
I read about the ff package, but have not found how to index by the first column? 我阅读了有关ff包的信息，但没有找到如何按第一列编制索引的信息？
Are there other ways ? 还有其他方法吗？

Thanks so much. 非常感谢。

1 个解决方案

I will use fread() from data.table package 我将使用data.table包中的fread()

Using the parameters skip and nrows you can play with the starting line to read ( skip ) or the number of rows to read ( nrows ) 使用参数skip和nrows可以播放要读取的起始行（ skip ）或要读取的行数（ nrows ）

If you want to explore the tidyverse approach I recommend you this solution R: Read in random rows from file using fread or equivalent? 如果您想探索tidyverse方法，我建议您使用此解决方案R：使用fread或等效方法从文件中读取随机行？