在巨大的表上操作：使用python一次一组行

Question

I have a huge table file that looks like the following. 我有一个巨大的表文件，如下所示。 In order to work on individual products (name), I tried to use pandas groupby, but it seems to put the whole table (~10G) in memory, which I cannot afford. 为了处理个别产品（名称），我尝试使用pandas groupby，但似乎把整个表（~10G）放在内存中，这是我买不起的。

name    index   change
A   Q   QQ
A   Q   QQ
A   Q   QQ
B   L   LL
C   Q   QQ
C   L   LL
C   LL  LL
C   Q   QQ
C   L   LL
C   LL  LL

The name column is well sorted and I will only care about one name at a time. 名称列排序很好，我一次只关心一个名称。 I hope to use the following criteria on column "change" to filter each name: 我希望在“更改”列上使用以下条件来过滤每个名称：

Check if number of "QQ" overwhelms number of "LL". 检查“QQ”的数量是否超过“LL”的数量。 Basically, if the number of rows contain "QQ" subtracts the number of rows contain "LL" >=2, then discard/ignore the "LL" rows for this name from now on. 基本上，如果行数包含“QQ”减去包含“LL”> = 2的行数，则从现在开始丢弃/忽略此名称的“LL”行。 If "LL" overwhelms "QQ", then discard rows with QQ. 如果“LL”压倒“QQ”，则用QQ丢弃行。 (Eg A has 3 QQ and 0 LL, and C has 4 LL and 2 QQ. They both are fine.) （例如A有3个QQ和0个LL，C有4个LL和2个QQ。它们都很好。）

Resulting table: 结果表：

name    index   change
A   Q   QQ
A   Q   QQ
A   Q   QQ
C   L   LL
C   LL  LL
C   L   LL
C   LL  LL

Comparing "change" to "index", if no change occurs (eg LL in both columns), the row is not valid. 将“更改”与“索引”进行比较，如果没有发生更改（例如，两列中的LL），则该行无效。 Further, for the valid changes, the remaining QQ or LL has to be continuous for >=3 times. 此外，对于有效的更改，剩余的QQ或LL必须连续> = 3次。 Therefore C only has 2 valid changes, and it will be filtered out. 因此C只有2个有效的更改，它将被过滤掉。

Resulting table: 结果表：

name    index   change
A   Q   QQ
A   Q   QQ
A   Q   QQ

I wonder if there is a way to just work on the table name by name, and release memory after each name please. 我想知道是否有办法按名称处理表名，并在每个名称后释放内存。 (And don't have to do the two criteria step by step.) Any hint or suggestion will be appreciated! （并且不必一步一步地执行这两个标准。）任何提示或建议将不胜感激！

Answer 1

Because the file is sorted by "name", you can read the file row-by-row: 因为文件按“名称”排序，所以您可以逐行读取文件：

def process_name(name, data, output_file):
    group_by = {}
    for index, change in data:
        if index not in group_by:
            group_by[index] = []
        group_by[index].append(change)

    # do the step 1 filter logic here

    # do the step 2 filter logic here
    for index in group_by:
        if index == group_by[index]:
            # Because there is at least one "no change" this 
            # whole "name" can be thrown out, so return here.
            return

    output = []
    for index in group_by:
        output_file.write("%s\t%s\t%s\n" % (name, index, group_by[index]))

current_name = None
current_data = []

input_file = open(input_filename, "r")
output_file = open(output_filename, "w")
header = input_file.readline()
for row in input_file:
    cols = row.strip().split("\t")
    name = cols[0]
    index = cols[1]
    change = cols[2]
    if name != current_name:
        if name != None:
            process_name(current_name, current_data, output_file)
        current_name = name
        current_data = []

    current_data.append((index, change))

# process what's left in the buffer
if current_name is not None:
    process_name(current_name, current_data, output_file)

input_file.close()
output_file.close()

I don't totally understand the logic you've explained in #1, so I left that blank. 我并不完全理解你在＃1中解释过的逻辑，所以我把它留空了。 I also feel like you probably want to do step #2 first as that will quickly rule out entire "name"s. 我也觉得你可能想先做第2步，因为这会很快排除整个“名字”。

Answer 2

Since your file is sorted and you only seem to be operating on the sub segments by name, perhaps just use Python's groupby and create a table for each name segment as you go: 由于您的文件已经排序，并且您似乎只是按名称操作子段，因此可能只需使用Python的groupby并为每个名称段创建一个表：

from itertools import groupby
import pandas as pd

with open('/tmp/so.csv') as f:
    header=next(f).split()
    for k, segment in groupby(f, key=lambda line: line.split()[0]):
        seg_data={k:[] for k in header}
        for e in segment:
            for key, v in zip(header, e.split()):
                seg_data[key].append(v)

        seg_fram=pd.DataFrame.from_dict(seg_data)
        print k
        print seg_fram
        print

Prints: 打印：

A
  change index name
0     QQ     Q    A
1     QQ     Q    A
2     QQ     Q    A

B
  change index name
0     LL     L    B

C
  change index name
0     QQ     Q    C
1     LL     L    C
2     LL    LL    C
3     QQ     Q    C
4     LL     L    C
5     LL    LL    C

Then the largest piece of memory you will have will be dictated by the largest contiguous group and not the size of the file. 然后，您将拥有的最大内存将由最大的连续组决定，而不是文件的大小。

You can use 1/2 the memory of that method by appending to the data frame row by row instead of building the intermediate dict: 您可以通过逐行附加到数据框而不是构建中间字典来使用该方法的内存的1/2：

with open('/tmp/so.csv') as f:
    header=next(f).split()
    for k, segment in groupby(f, key=lambda line: line.split()[0]):
        seg_data={k:[] for k in header}
        seg_fram=pd.DataFrame(columns=header)
        for idx, e in enumerate(segment):
             df=pd.DataFrame({k:v for k, v in zip(header, e.split())}, index=[idx])
             seg_fram=seg_fram.append(df)

(might be slower though...) （可能会慢一些......）

If that does not work, consider using a disk database. 如果这不起作用，请考虑使用磁盘数据库。

在巨大的表上操作：使用python一次一组行

问题描述

2 个解决方案

解决方案1
2 已采纳 2015-09-12 23:57:21

解决方案2
1 2015-09-13 02:20:28

在巨大的表上操作：使用python一次一组行

问题描述

2 个解决方案

解决方案1 2 已采纳 2015-09-12 23:57:21

解决方案2 1 2015-09-13 02:20:28

解决方案1
2 已采纳 2015-09-12 23:57:21

解决方案2
1 2015-09-13 02:20:28