使用bigmemory将40 GB csv文件读入R中

Question

The title is pretty self explanatory here but I will elaborate as follows. 标题在这里非常自我解释，但我将详细说明如下。 Some of my current techniques in attacking this problem are based on the solutions presented in this question. 一些在攻击这个问题我目前的技术是基于中提出的解决方案，这个问题。 However, I am facing several challenges and constraints so I was wondering if someone might attempt to take a stab at this problem. 但是，我面临着一些挑战和限制，所以我想知道是否有人可能会试图解决这个问题。 I am trying to figure out the problem using the bigmemory package but I have been running into difficulties. 我试图用bigmemory包解决问题，但我遇到了困难。

Present Constraints: 目前的限制：

Using a linux server with 16 GB of RAM 使用具有16 GB RAM的Linux服务器
Size of 40 GB CSV 大小为40 GB CSV
No of rows: 67,194,126,114 行数：67,194,126,114

Challenges 挑战

Need to be able to randomly sample smaller datasets (5-10 Million rows) from a big.matrix or equivalent data structure. 需要能够从big.matrix或等效数据结构中随机采样较小的数据集（5-10百万行）。
Need to be able to remove any row with a single instance of NULL while parsing into a big.matrix or equivalent data structure. 在解析为big.matrix或等效数据结构时，需要能够使用单个NULL实例删除任何行。

So far, results are not good. 到目前为止，结果并不好。 Evidently, I am failing at something or maybe, I just don't understand the bigmemory documentation well enough. 显然，我在某些事情上失败了，或许，我只是不太了解大记忆文档。 So, I thought I would ask here to see if anyone has used 所以，我想我会问这里是否有人使用过

Any tips, advice on this line of attack etc.? 有关此攻击线的任何提示，建议等？ Or should I change to something else? 或者我应该改变别的吗？ I apologize if this question is very similar to the previous but I thought by scale of data was about 20 times bigger than the previous questions. 如果这个问题与之前的问题非常相似，我很抱歉，但我认为数据规模比以前的问题大20倍。 Thanks ! 谢谢！

Answer 1

I don't know about bigmemory , but to satisfy your challenges you don't need to read the file in. Simply pipe some bash/awk/sed/python/whatever processing to do the steps you want, ie throw out NULL lines and randomly select N lines, and then read that in. 我不知道bigmemory ，但是为了满足你的挑战，你不需要读取文件。只需管道一些bash / awk / sed / python /无论处理什么来执行你想要的步骤，即抛出NULL行和随机选择N行，然后读入。

Here's an example using awk (assuming you want 100 random lines from a file that has 1M lines). 这是一个使用awk的例子（假设您需要来自具有1M行的文件的100个随机行）。

read.csv(pipe('awk -F, \'BEGIN{srand(); m = 100; length = 1000000;}
                       !/NULL/{if (rand() < m/(length - NR + 1)) {
                                 print; m--;
                                 if (m == 0) exit;
                              }}\' filename'
        )) -> df

It wasn't obvious to me what you meant by NULL , so I used literal understanding of it, but it should be easy to modify it to fit your needs. 这不是明显给我你的意思NULL ，所以我用它的字面理解，但应该很容易修改，以满足您的需求。

Answer 2

This is a pure R solution to the challenge of sampling from a large text file; 这是从大文本文件中采样挑战的纯R解决方案; it has the additional merit of drawing a random sample of exactly n. 它具有绘制正好n的随机样本的额外优点。 It is not too inefficient, though lines are parsed to character vectors and this is relatively slow. 尽管将行解析为字符向量并且这相对较慢，但效率并不太低。

We start with a function signature, where we provide a file name, the size of the sample we want to draw, a seed for the random number generator (so that we can reproduce our random sample!), an indication of whether there's a header line, and then a "reader" function that we'll use to parse the sample into the object seen by R, including additional arguments ... that the reader function might need 我们从一个函数签名开始，我们提供一个文件名，我们想要绘制的样本的大小，随机数生成器的种子（以便我们可以重现我们的随机样本！），指示是否有一个头行，然后，我们将用它来分析样品进入由R看到的物体，包括额外的参数的“读者”的功能...是阅读器功能可能需要

fsample <-
    function(fname, n, seed, header=FALSE, ..., reader=read.csv)
{

The function seeds the random number generator, opens a connection, and reads in the (optional) header line 该函数为随机数生成器播种，打开连接，并读入（可选）标题行

    set.seed(seed)
    con <- file(fname, open="r")
    hdr <- if (header) {
        readLines(con, 1L)
    } else character()

The next step is to read in a chunk of n lines, initializing a counter of the total number of lines seen 下一步是读入一大块n行，初始化所看到的总行数的计数器

    buf <- readLines(con, n)
    n_tot <- length(buf)

Continue to read in chunks of n lines, stopping when there is no further input 继续读取n行的块，在没有进一步输入时停止

    repeat {
        txt <- readLines(con, n)
        if ((n_txt <- length(txt)) == 0L)
            break

For each chunk, draw a sample of n_keep lines, with the number of lines proportional to the fraction of total lines in the current chunk. 对于每个块，绘制n_keep行的样本， n_keep行数与当前块中总行数的比例成比例。 This ensures that lines are sampled uniformly over the file. 这可确保在文件上均匀采样行。 If there are no lines to keep, move to the next chunk. 如果没有要保留的行，请移动到下一个块。

        n_tot <- n_tot + n_txt
        n_keep <- rbinom(1, n_txt, n_txt / n_tot)
        if (n_keep == 0L)
            next

Choose the lines to keep, and the lines to replace, and update the buffer 选择要保留的行和要替换的行，然后更新缓冲区

        keep <- sample(n_txt, n_keep)
        drop <- sample(n, n_keep)
        buf[drop] <- txt[keep]
    }

When data input is done, we parse the result using the reader and return the result 完成数据输入后，我们使用阅读器解析结果并返回结果

    reader(textConnection(c(hdr, buf), header=header, ...)
}

The solution could be made more efficient, but a bit more complicated, by using readBin and searching for line breaks as suggested by Simon Urbanek on the R-devel mailing list . 通过使用readBin并搜索Simon Urbanek在R-devel 邮件列表中建议的换行符，可以提高解决方案的效率，但有点复杂。 Here's the full solution 这是完整的解决方案

fsample <-
    function(fname, n, seed, header=FALSE, ..., reader = read.csv)
{
    set.seed(seed)
    con <- file(fname, open="r")
    hdr <- if (header) {
        readLines(con, 1L)
    } else character()

    buf <- readLines(con, n)
    n_tot <- length(buf)

    repeat {
        txt <- readLines(con, n)
        if ((n_txt <- length(txt)) == 0L)
            break

        n_tot <- n_tot + n_txt
        n_keep <- rbinom(1, n_txt, n_txt / n_tot)
        if (n_keep == 0L)
            next

        keep <- sample(n_txt, n_keep)
        drop <- sample(n, n_keep)
        buf[drop] <- txt[keep]
    }

    reader(textConnection(c(hdr, buf)), header=header, ...)
}

使用bigmemory将40 GB csv文件读入R中

问题描述

2 个解决方案

解决方案1
18 已采纳 2013-04-03 21:15:45

解决方案2
13 2013-08-16 20:58:00

使用bigmemory将40 GB csv文件读入R中

问题描述

2 个解决方案

解决方案1 18 已采纳 2013-04-03 21:15:45

解决方案2 13 2013-08-16 20:58:00

解决方案1
18 已采纳 2013-04-03 21:15:45

解决方案2
13 2013-08-16 20:58:00