在巨大的表上操作：使用python一次一組行

Question

我有一個巨大的表文件，如下所示。 為了處理個別產品（名稱），我嘗試使用pandas groupby，但似乎把整個表（~10G）放在內存中，這是我買不起的。

name    index   change
A   Q   QQ
A   Q   QQ
A   Q   QQ
B   L   LL
C   Q   QQ
C   L   LL
C   LL  LL
C   Q   QQ
C   L   LL
C   LL  LL

名稱列排序很好，我一次只關心一個名稱。 我希望在“更改”列上使用以下條件來過濾每個名稱：

檢查“QQ”的數量是否超過“LL”的數量。 基本上，如果行數包含“QQ”減去包含“LL”> = 2的行數，則從現在開始丟棄/忽略此名稱的“LL”行。 如果“LL”壓倒“QQ”，則用QQ丟棄行。 （例如A有3個QQ和0個LL，C有4個LL和2個QQ。它們都很好。）

結果表：

name    index   change
A   Q   QQ
A   Q   QQ
A   Q   QQ
C   L   LL
C   LL  LL
C   L   LL
C   LL  LL

將“更改”與“索引”進行比較，如果沒有發生更改（例如，兩列中的LL），則該行無效。 此外，對於有效的更改，剩余的QQ或LL必須連續> = 3次。 因此C只有2個有效的更改，它將被過濾掉。

結果表：

name    index   change
A   Q   QQ
A   Q   QQ
A   Q   QQ

我想知道是否有辦法按名稱處理表名，並在每個名稱后釋放內存。 （並且不必一步一步地執行這兩個標准。）任何提示或建議將不勝感激！

Answer 1

因為文件按“名稱”排序，所以您可以逐行讀取文件：

def process_name(name, data, output_file):
    group_by = {}
    for index, change in data:
        if index not in group_by:
            group_by[index] = []
        group_by[index].append(change)

    # do the step 1 filter logic here

    # do the step 2 filter logic here
    for index in group_by:
        if index == group_by[index]:
            # Because there is at least one "no change" this 
            # whole "name" can be thrown out, so return here.
            return

    output = []
    for index in group_by:
        output_file.write("%s\t%s\t%s\n" % (name, index, group_by[index]))

current_name = None
current_data = []

input_file = open(input_filename, "r")
output_file = open(output_filename, "w")
header = input_file.readline()
for row in input_file:
    cols = row.strip().split("\t")
    name = cols[0]
    index = cols[1]
    change = cols[2]
    if name != current_name:
        if name != None:
            process_name(current_name, current_data, output_file)
        current_name = name
        current_data = []

    current_data.append((index, change))

# process what's left in the buffer
if current_name is not None:
    process_name(current_name, current_data, output_file)

input_file.close()
output_file.close()

我並不完全理解你在＃1中解釋過的邏輯，所以我把它留空了。 我也覺得你可能想先做第2步，因為這會很快排除整個“名字”。

Answer 2

由於您的文件已經排序，並且您似乎只是按名稱操作子段，因此可能只需使用Python的groupby並為每個名稱段創建一個表：

from itertools import groupby
import pandas as pd

with open('/tmp/so.csv') as f:
    header=next(f).split()
    for k, segment in groupby(f, key=lambda line: line.split()[0]):
        seg_data={k:[] for k in header}
        for e in segment:
            for key, v in zip(header, e.split()):
                seg_data[key].append(v)

        seg_fram=pd.DataFrame.from_dict(seg_data)
        print k
        print seg_fram
        print

打印：

A
  change index name
0     QQ     Q    A
1     QQ     Q    A
2     QQ     Q    A

B
  change index name
0     LL     L    B

C
  change index name
0     QQ     Q    C
1     LL     L    C
2     LL    LL    C
3     QQ     Q    C
4     LL     L    C
5     LL    LL    C

然后，您將擁有的最大內存將由最大的連續組決定，而不是文件的大小。

您可以通過逐行附加到數據框而不是構建中間字典來使用該方法的內存的1/2：

with open('/tmp/so.csv') as f:
    header=next(f).split()
    for k, segment in groupby(f, key=lambda line: line.split()[0]):
        seg_data={k:[] for k in header}
        seg_fram=pd.DataFrame(columns=header)
        for idx, e in enumerate(segment):
             df=pd.DataFrame({k:v for k, v in zip(header, e.split())}, index=[idx])
             seg_fram=seg_fram.append(df)

（可能會慢一些......）

如果這不起作用，請考慮使用磁盤數據庫。

在巨大的表上操作：使用python一次一組行

問題描述

2 個解決方案

解決方案1
2 已采納 2015-09-12 23:57:21

解決方案2
1 2015-09-13 02:20:28

在巨大的表上操作：使用python一次一組行

問題描述

2 個解決方案

解決方案1 2 已采納 2015-09-12 23:57:21

解決方案2 1 2015-09-13 02:20:28

解決方案1
2 已采納 2015-09-12 23:57:21

解決方案2
1 2015-09-13 02:20:28