简体   繁体   中英

Operating on a huge table: group of rows at a time using python

I have a huge table file that looks like the following. In order to work on individual products (name), I tried to use pandas groupby, but it seems to put the whole table (~10G) in memory, which I cannot afford.

name    index   change
A   Q   QQ
A   Q   QQ
A   Q   QQ
B   L   LL
C   Q   QQ
C   L   LL
C   LL  LL
C   Q   QQ
C   L   LL
C   LL  LL

The name column is well sorted and I will only care about one name at a time. I hope to use the following criteria on column "change" to filter each name:

  1. Check if number of "QQ" overwhelms number of "LL". Basically, if the number of rows contain "QQ" subtracts the number of rows contain "LL" >=2, then discard/ignore the "LL" rows for this name from now on. If "LL" overwhelms "QQ", then discard rows with QQ. (Eg A has 3 QQ and 0 LL, and C has 4 LL and 2 QQ. They both are fine.)

Resulting table:

name    index   change
A   Q   QQ
A   Q   QQ
A   Q   QQ
C   L   LL
C   LL  LL
C   L   LL
C   LL  LL
  1. Comparing "change" to "index", if no change occurs (eg LL in both columns), the row is not valid. Further, for the valid changes, the remaining QQ or LL has to be continuous for >=3 times. Therefore C only has 2 valid changes, and it will be filtered out.

Resulting table:

name    index   change
A   Q   QQ
A   Q   QQ
A   Q   QQ

I wonder if there is a way to just work on the table name by name, and release memory after each name please. (And don't have to do the two criteria step by step.) Any hint or suggestion will be appreciated!

Because the file is sorted by "name", you can read the file row-by-row:

def process_name(name, data, output_file):
    group_by = {}
    for index, change in data:
        if index not in group_by:
            group_by[index] = []
        group_by[index].append(change)

    # do the step 1 filter logic here

    # do the step 2 filter logic here
    for index in group_by:
        if index == group_by[index]:
            # Because there is at least one "no change" this 
            # whole "name" can be thrown out, so return here.
            return

    output = []
    for index in group_by:
        output_file.write("%s\t%s\t%s\n" % (name, index, group_by[index]))

current_name = None
current_data = []

input_file = open(input_filename, "r")
output_file = open(output_filename, "w")
header = input_file.readline()
for row in input_file:
    cols = row.strip().split("\t")
    name = cols[0]
    index = cols[1]
    change = cols[2]
    if name != current_name:
        if name != None:
            process_name(current_name, current_data, output_file)
        current_name = name
        current_data = []

    current_data.append((index, change))

# process what's left in the buffer
if current_name is not None:
    process_name(current_name, current_data, output_file)

input_file.close()
output_file.close()

I don't totally understand the logic you've explained in #1, so I left that blank. I also feel like you probably want to do step #2 first as that will quickly rule out entire "name"s.

Since your file is sorted and you only seem to be operating on the sub segments by name, perhaps just use Python's groupby and create a table for each name segment as you go:

from itertools import groupby
import pandas as pd

with open('/tmp/so.csv') as f:
    header=next(f).split()
    for k, segment in groupby(f, key=lambda line: line.split()[0]):
        seg_data={k:[] for k in header}
        for e in segment:
            for key, v in zip(header, e.split()):
                seg_data[key].append(v)

        seg_fram=pd.DataFrame.from_dict(seg_data)
        print k
        print seg_fram
        print

Prints:

A
  change index name
0     QQ     Q    A
1     QQ     Q    A
2     QQ     Q    A

B
  change index name
0     LL     L    B

C
  change index name
0     QQ     Q    C
1     LL     L    C
2     LL    LL    C
3     QQ     Q    C
4     LL     L    C
5     LL    LL    C

Then the largest piece of memory you will have will be dictated by the largest contiguous group and not the size of the file.

You can use 1/2 the memory of that method by appending to the data frame row by row instead of building the intermediate dict:

with open('/tmp/so.csv') as f:
    header=next(f).split()
    for k, segment in groupby(f, key=lambda line: line.split()[0]):
        seg_data={k:[] for k in header}
        seg_fram=pd.DataFrame(columns=header)
        for idx, e in enumerate(segment):
             df=pd.DataFrame({k:v for k, v in zip(header, e.split())}, index=[idx])
             seg_fram=seg_fram.append(df)

(might be slower though...)

If that does not work, consider using a disk database.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM