简体   繁体   English

分块处理大文件

[英]Process large file in chunks

I have a large file which has two numbers per line and is sorted by the second column. 我有一个大文件,每行有两个数字,并按第二列排序。 I make a dictionary of lists keyed on the first column. 我制作了第一列上列出的列表的字典。

My code looks like 我的代码看起来像

from collections import defaultdict
d = defaultdict(list)
for line in fin.readline():
    vals = line.split()
    d[vals[0]].append(vals[1])
process(d)

However the input file large is too large so d will not fit into memory. 但是,输入文件large太大,因此d无法放入内存。

To get round this I can in principle read in chunks of the file at a time but I need to make an overlap between the chunks so that process(d) won't miss anything. 为了解决这个问题,原则上我可以一次读取文件的多个块,但是我需要在这些块之间进行重叠,以便process(d)不会丢失任何内容。

In pseudocode I could do the following. 用伪代码可以执行以下操作。

  1. Read 100 lines creating the dictionary d . 读取100行以创建字典d
  2. Process the dictionary d 处理字典d
  3. Delete everything from d that is not within 10 of the max value seen so far. d中删除所有不在当前可见值最大值10以内的内容。
  4. Repeat but making sure we don't have more than 100 lines worth of data in d at any time. 重复一次,但要确保在任何时候d中没有超过100行的数据。

Is there a nice way to do this in python? 有没有在python中执行此操作的好方法?

Update. 更新。 More details of the problem. 问题的更多细节。 I will use d when reading in a second file of pairs where I will output the pair if depending on how many values there are in the list associated with the first value in d which are within 10. The second file is also sorted by the second column. 当读取第二对文件时,我将使用d ,如果输出取决于取决于与d中第一个值相关的列表中有多少个值(在10之内),我将输出该对。第二个文件也按第二个排序柱。

Fake data. 伪数据。 Let's say we can fit 5 lines of data into memory and we need the overlap in values to be 5 as well. 假设我们可以将5行数据放入内存中,并且值的重叠部分也必须为5。

1 1
2 1
1 6
7 6
1 16

So now d is {1:[1,6,16],2:[1],7:[6]}. 因此,现在d为{1:[1,6,16],2:[1],7:[6]}。

For the next chunk we only need to keep the last value (as 16-6 > 5). 对于下一个块,我们只需要保留最后一个值(如16-6> 5)。 So we would set 所以我们将

d to be {1:[16]} and continue reading the next 4 lines. d为{1:[16]},然后继续阅读接下来的4行。

Have you tried out the Pandas library , and in particular reading your data into a DataFrame then using groupby on the first column? 您是否尝试过Pandas库 ,特别是将数据读入DataFrame然后在第一列上使用groupby

Pandas will let you do a lot of bulk operations effectively across your data, and you can read it in lazily if you want to. 熊猫可以让您有效地对数据进行大量批量操作,并且您可以根据需要懒惰地读取数据。

You don't need default dict unless something strange is going on with the file, but you haven't mentioned what that is. 您不需要默认dict,除非该文件发生了一些奇怪的事情,但是您没有提到它是什么。 Instead, use a list, which keeps your data in line order, that way you can process it using the appropriate slices thus: 而是使用一个列表,该列表使您的数据保持一致,这样您就可以使用适当的切片来处理它,从而:

d = []
for line in fin.readline():
    vals = line.split()
    d.append(vals)
    if not len(d)%100:
        process(d)
        d = d[90:]
process(d)

You could do this something like this: 您可以这样做:

n_process = 100
n_overlap = 10
data_chunk = []
for line in fin.readline():
    vals = line.split()
    data_chunk.append(vals)
    if len(data_chunk) == n_process:
        process(data_chunk)
        data_chunk = data_chunk[-n_overlap:]

When using a dictionary, data can be overwritten if multiple occurrences of numbers in the first column are present in a data sample. 使用字典时,如果数据样本中第一列中多次出现数字,则可以覆盖数据。 Also notice, that you need to use OrderedDict , since a dict does not have an order in python. 还要注意,您需要使用OrderedDict ,因为dict在python中没有顺序。 However, in my opinion, OrderedDict is in most cases a sign of bad code design. 但是,我认为, OrderedDict在大多数情况下是代码设计不良的标志。

And by the way: we still don't know why you're trying to do this way… 顺便说一句:我们仍然不知道您为什么要尝试这种方式…

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM