分块处理大文件

Question

I have a large file which has two numbers per line and is sorted by the second column. 我有一个大文件，每行有两个数字，并按第二列排序。 I make a dictionary of lists keyed on the first column. 我制作了第一列上列出的列表的字典。

My code looks like 我的代码看起来像

from collections import defaultdict
d = defaultdict(list)
for line in fin.readline():
    vals = line.split()
    d[vals[0]].append(vals[1])
process(d)

However the input file large is too large so d will not fit into memory. 但是，输入文件large太大，因此d无法放入内存。

To get round this I can in principle read in chunks of the file at a time but I need to make an overlap between the chunks so that process(d) won't miss anything. 为了解决这个问题，原则上我可以一次读取文件的多个块，但是我需要在这些块之间进行重叠，以便process(d)不会丢失任何内容。

In pseudocode I could do the following. 用伪代码可以执行以下操作。

Read 100 lines creating the dictionary d . 读取100行以创建字典d 。
Process the dictionary d 处理字典d
Delete everything from d that is not within 10 of the max value seen so far. 从d中删除所有不在当前可见值最大值10以内的内容。
Repeat but making sure we don't have more than 100 lines worth of data in d at any time. 重复一次，但要确保在任何时候d中没有超过100行的数据。

Is there a nice way to do this in python? 有没有在python中执行此操作的好方法？

Update. 更新。 More details of the problem. 问题的更多细节。 I will use d when reading in a second file of pairs where I will output the pair if depending on how many values there are in the list associated with the first value in d which are within 10. The second file is also sorted by the second column. 当读取第二对文件时，我将使用d ，如果输出取决于取决于与d中第一个值相关的列表中有多少个值（在10之内），我将输出该对。第二个文件也按第二个排序柱。

Fake data. 伪数据。 Let's say we can fit 5 lines of data into memory and we need the overlap in values to be 5 as well. 假设我们可以将5行数据放入内存中，并且值的重叠部分也必须为5。

So now d is {1:[1,6,16],2:[1],7:[6]}. 因此，现在d为{1：[1,6,16]，2：[1]，7：[6]}。

For the next chunk we only need to keep the last value (as 16-6 > 5). 对于下一个块，我们只需要保留最后一个值（如16-6> 5）。 So we would set 所以我们将

d to be {1:[16]} and continue reading the next 4 lines. d为{1：[16]}，然后继续阅读接下来的4行。

Answer 1

Have you tried out the Pandas library , and in particular reading your data into a DataFrame then using groupby on the first column? 您是否尝试过Pandas库，特别是将数据读入DataFrame然后在第一列上使用groupby ？

Pandas will let you do a lot of bulk operations effectively across your data, and you can read it in lazily if you want to. 熊猫可以让您有效地对数据进行大量批量操作，并且您可以根据需要懒惰地读取数据。

Answer 2

You don't need default dict unless something strange is going on with the file, but you haven't mentioned what that is. 您不需要默认dict，除非该文件发生了一些奇怪的事情，但是您没有提到它是什么。 Instead, use a list, which keeps your data in line order, that way you can process it using the appropriate slices thus: 而是使用一个列表，该列表使您的数据保持一致，这样您就可以使用适当的切片来处理它，从而：

d = []
for line in fin.readline():
    vals = line.split()
    d.append(vals)
    if not len(d)%100:
        process(d)
        d = d[90:]
process(d)

Answer 3

You could do this something like this: 您可以这样做：

n_process = 100
n_overlap = 10
data_chunk = []
for line in fin.readline():
    vals = line.split()
    data_chunk.append(vals)
    if len(data_chunk) == n_process:
        process(data_chunk)
        data_chunk = data_chunk[-n_overlap:]

When using a dictionary, data can be overwritten if multiple occurrences of numbers in the first column are present in a data sample. 使用字典时，如果数据样本中第一列中多次出现数字，则可以覆盖数据。 Also notice, that you need to use OrderedDict , since a dict does not have an order in python. 还要注意，您需要使用OrderedDict ，因为dict在python中没有顺序。 However, in my opinion, OrderedDict is in most cases a sign of bad code design. 但是，我认为， OrderedDict在大多数情况下是代码设计不良的标志。

And by the way: we still don't know why you're trying to do this way… 顺便说一句：我们仍然不知道您为什么要尝试这种方式…

分块处理大文件

问题描述

3 个解决方案

解决方案1
2 已采纳 2013-07-25 20:27:19

解决方案2
0 2013-07-25 19:19:31

解决方案3
0 2013-07-25 19:19:51

分块处理大文件

问题描述

3 个解决方案

解决方案1 2 已采纳 2013-07-25 20:27:19

解决方案2 0 2013-07-25 19:19:31

解决方案3 0 2013-07-25 19:19:51

解决方案1
2 已采纳 2013-07-25 20:27:19

解决方案2
0 2013-07-25 19:19:31

解决方案3
0 2013-07-25 19:19:51