[英]Process large file in chunks
I have a large file which has two numbers per line and is sorted by the second column. 我有一个大文件,每行有两个数字,并按第二列排序。 I make a dictionary of lists keyed on the first column.
我制作了第一列上列出的列表的字典。
My code looks like 我的代码看起来像
from collections import defaultdict
d = defaultdict(list)
for line in fin.readline():
vals = line.split()
d[vals[0]].append(vals[1])
process(d)
However the input file large is too large so d
will not fit into memory. 但是,输入文件large太大,因此
d
无法放入内存。
To get round this I can in principle read in chunks of the file at a time but I need to make an overlap between the chunks so that process(d)
won't miss anything. 为了解决这个问题,原则上我可以一次读取文件的多个块,但是我需要在这些块之间进行重叠,以便
process(d)
不会丢失任何内容。
In pseudocode I could do the following. 用伪代码可以执行以下操作。
d
. d
。 d
d
d
that is not within 10 of the max value seen so far. d
中删除所有不在当前可见值最大值10以内的内容。 d
at any time. d
中没有超过100行的数据。 Is there a nice way to do this in python? 有没有在python中执行此操作的好方法?
Update. 更新。 More details of the problem.
问题的更多细节。 I will use
d
when reading in a second file of pairs where I will output the pair if depending on how many values there are in the list associated with the first value in d
which are within 10. The second file is also sorted by the second column. 当读取第二对文件时,我将使用
d
,如果输出取决于取决于与d
中第一个值相关的列表中有多少个值(在10之内),我将输出该对。第二个文件也按第二个排序柱。
Fake data. 伪数据。 Let's say we can fit 5 lines of data into memory and we need the overlap in values to be 5 as well.
假设我们可以将5行数据放入内存中,并且值的重叠部分也必须为5。
1 1
2 1
1 6
7 6
1 16
So now d is {1:[1,6,16],2:[1],7:[6]}. 因此,现在d为{1:[1,6,16],2:[1],7:[6]}。
For the next chunk we only need to keep the last value (as 16-6 > 5). 对于下一个块,我们只需要保留最后一个值(如16-6> 5)。 So we would set
所以我们将
d to be {1:[16]} and continue reading the next 4 lines. d为{1:[16]},然后继续阅读接下来的4行。
Have you tried out the Pandas library , and in particular reading your data into a DataFrame then using groupby on the first column? 您是否尝试过Pandas库 ,特别是将数据读入DataFrame然后在第一列上使用groupby ?
Pandas will let you do a lot of bulk operations effectively across your data, and you can read it in lazily if you want to. 熊猫可以让您有效地对数据进行大量批量操作,并且您可以根据需要懒惰地读取数据。
You don't need default dict unless something strange is going on with the file, but you haven't mentioned what that is. 您不需要默认dict,除非该文件发生了一些奇怪的事情,但是您没有提到它是什么。 Instead, use a list, which keeps your data in line order, that way you can process it using the appropriate slices thus:
而是使用一个列表,该列表使您的数据保持一致,这样您就可以使用适当的切片来处理它,从而:
d = []
for line in fin.readline():
vals = line.split()
d.append(vals)
if not len(d)%100:
process(d)
d = d[90:]
process(d)
You could do this something like this: 您可以这样做:
n_process = 100
n_overlap = 10
data_chunk = []
for line in fin.readline():
vals = line.split()
data_chunk.append(vals)
if len(data_chunk) == n_process:
process(data_chunk)
data_chunk = data_chunk[-n_overlap:]
When using a dictionary, data can be overwritten if multiple occurrences of numbers in the first column are present in a data sample. 使用字典时,如果数据样本中第一列中多次出现数字,则可以覆盖数据。 Also notice, that you need to use
OrderedDict
, since a dict
does not have an order in python. 还要注意,您需要使用
OrderedDict
,因为dict
在python中没有顺序。 However, in my opinion, OrderedDict
is in most cases a sign of bad code design. 但是,我认为,
OrderedDict
在大多数情况下是代码设计不良的标志。
And by the way: we still don't know why you're trying to do this way… 顺便说一句:我们仍然不知道您为什么要尝试这种方式…
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.