[英]Operating on a huge table: group of rows at a time using python
I have a huge table file that looks like the following. 我有一个巨大的表文件,如下所示。 In order to work on individual products (name), I tried to use pandas groupby, but it seems to put the whole table (~10G) in memory, which I cannot afford.
为了处理个别产品(名称),我尝试使用pandas groupby,但似乎把整个表(~10G)放在内存中,这是我买不起的。
name index change
A Q QQ
A Q QQ
A Q QQ
B L LL
C Q QQ
C L LL
C LL LL
C Q QQ
C L LL
C LL LL
The name column is well sorted and I will only care about one name at a time. 名称列排序很好,我一次只关心一个名称。 I hope to use the following criteria on column "change" to filter each name:
我希望在“更改”列上使用以下条件来过滤每个名称:
Resulting table: 结果表:
name index change
A Q QQ
A Q QQ
A Q QQ
C L LL
C LL LL
C L LL
C LL LL
Resulting table: 结果表:
name index change
A Q QQ
A Q QQ
A Q QQ
I wonder if there is a way to just work on the table name by name, and release memory after each name please. 我想知道是否有办法按名称处理表名,并在每个名称后释放内存。 (And don't have to do the two criteria step by step.) Any hint or suggestion will be appreciated!
(并且不必一步一步地执行这两个标准。)任何提示或建议将不胜感激!
Because the file is sorted by "name", you can read the file row-by-row: 因为文件按“名称”排序,所以您可以逐行读取文件:
def process_name(name, data, output_file):
group_by = {}
for index, change in data:
if index not in group_by:
group_by[index] = []
group_by[index].append(change)
# do the step 1 filter logic here
# do the step 2 filter logic here
for index in group_by:
if index == group_by[index]:
# Because there is at least one "no change" this
# whole "name" can be thrown out, so return here.
return
output = []
for index in group_by:
output_file.write("%s\t%s\t%s\n" % (name, index, group_by[index]))
current_name = None
current_data = []
input_file = open(input_filename, "r")
output_file = open(output_filename, "w")
header = input_file.readline()
for row in input_file:
cols = row.strip().split("\t")
name = cols[0]
index = cols[1]
change = cols[2]
if name != current_name:
if name != None:
process_name(current_name, current_data, output_file)
current_name = name
current_data = []
current_data.append((index, change))
# process what's left in the buffer
if current_name is not None:
process_name(current_name, current_data, output_file)
input_file.close()
output_file.close()
I don't totally understand the logic you've explained in #1, so I left that blank. 我并不完全理解你在#1中解释过的逻辑,所以我把它留空了。 I also feel like you probably want to do step #2 first as that will quickly rule out entire "name"s.
我也觉得你可能想先做第2步,因为这会很快排除整个“名字”。
Since your file is sorted and you only seem to be operating on the sub segments by name, perhaps just use Python's groupby and create a table for each name segment as you go: 由于您的文件已经排序,并且您似乎只是按名称操作子段,因此可能只需使用Python的groupby并为每个名称段创建一个表:
from itertools import groupby
import pandas as pd
with open('/tmp/so.csv') as f:
header=next(f).split()
for k, segment in groupby(f, key=lambda line: line.split()[0]):
seg_data={k:[] for k in header}
for e in segment:
for key, v in zip(header, e.split()):
seg_data[key].append(v)
seg_fram=pd.DataFrame.from_dict(seg_data)
print k
print seg_fram
print
Prints: 打印:
A
change index name
0 QQ Q A
1 QQ Q A
2 QQ Q A
B
change index name
0 LL L B
C
change index name
0 QQ Q C
1 LL L C
2 LL LL C
3 QQ Q C
4 LL L C
5 LL LL C
Then the largest piece of memory you will have will be dictated by the largest contiguous group and not the size of the file. 然后,您将拥有的最大内存将由最大的连续组决定,而不是文件的大小。
You can use 1/2 the memory of that method by appending to the data frame row by row instead of building the intermediate dict: 您可以通过逐行附加到数据框而不是构建中间字典来使用该方法的内存的1/2:
with open('/tmp/so.csv') as f:
header=next(f).split()
for k, segment in groupby(f, key=lambda line: line.split()[0]):
seg_data={k:[] for k in header}
seg_fram=pd.DataFrame(columns=header)
for idx, e in enumerate(segment):
df=pd.DataFrame({k:v for k, v in zip(header, e.split())}, index=[idx])
seg_fram=seg_fram.append(df)
(might be slower though...) (可能会慢一些......)
If that does not work, consider using a disk database. 如果这不起作用,请考虑使用磁盘数据库。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.