[英]Operating on a huge table: group of rows at a time using python
我有一个巨大的表文件,如下所示。 为了处理个别产品(名称),我尝试使用pandas groupby,但似乎把整个表(~10G)放在内存中,这是我买不起的。
name index change
A Q QQ
A Q QQ
A Q QQ
B L LL
C Q QQ
C L LL
C LL LL
C Q QQ
C L LL
C LL LL
名称列排序很好,我一次只关心一个名称。 我希望在“更改”列上使用以下条件来过滤每个名称:
结果表:
name index change
A Q QQ
A Q QQ
A Q QQ
C L LL
C LL LL
C L LL
C LL LL
结果表:
name index change
A Q QQ
A Q QQ
A Q QQ
我想知道是否有办法按名称处理表名,并在每个名称后释放内存。 (并且不必一步一步地执行这两个标准。)任何提示或建议将不胜感激!
因为文件按“名称”排序,所以您可以逐行读取文件:
def process_name(name, data, output_file):
group_by = {}
for index, change in data:
if index not in group_by:
group_by[index] = []
group_by[index].append(change)
# do the step 1 filter logic here
# do the step 2 filter logic here
for index in group_by:
if index == group_by[index]:
# Because there is at least one "no change" this
# whole "name" can be thrown out, so return here.
return
output = []
for index in group_by:
output_file.write("%s\t%s\t%s\n" % (name, index, group_by[index]))
current_name = None
current_data = []
input_file = open(input_filename, "r")
output_file = open(output_filename, "w")
header = input_file.readline()
for row in input_file:
cols = row.strip().split("\t")
name = cols[0]
index = cols[1]
change = cols[2]
if name != current_name:
if name != None:
process_name(current_name, current_data, output_file)
current_name = name
current_data = []
current_data.append((index, change))
# process what's left in the buffer
if current_name is not None:
process_name(current_name, current_data, output_file)
input_file.close()
output_file.close()
我并不完全理解你在#1中解释过的逻辑,所以我把它留空了。 我也觉得你可能想先做第2步,因为这会很快排除整个“名字”。
由于您的文件已经排序,并且您似乎只是按名称操作子段,因此可能只需使用Python的groupby并为每个名称段创建一个表:
from itertools import groupby
import pandas as pd
with open('/tmp/so.csv') as f:
header=next(f).split()
for k, segment in groupby(f, key=lambda line: line.split()[0]):
seg_data={k:[] for k in header}
for e in segment:
for key, v in zip(header, e.split()):
seg_data[key].append(v)
seg_fram=pd.DataFrame.from_dict(seg_data)
print k
print seg_fram
print
打印:
A
change index name
0 QQ Q A
1 QQ Q A
2 QQ Q A
B
change index name
0 LL L B
C
change index name
0 QQ Q C
1 LL L C
2 LL LL C
3 QQ Q C
4 LL L C
5 LL LL C
然后,您将拥有的最大内存将由最大的连续组决定,而不是文件的大小。
您可以通过逐行附加到数据框而不是构建中间字典来使用该方法的内存的1/2:
with open('/tmp/so.csv') as f:
header=next(f).split()
for k, segment in groupby(f, key=lambda line: line.split()[0]):
seg_data={k:[] for k in header}
seg_fram=pd.DataFrame(columns=header)
for idx, e in enumerate(segment):
df=pd.DataFrame({k:v for k, v in zip(header, e.split())}, index=[idx])
seg_fram=seg_fram.append(df)
(可能会慢一些......)
如果这不起作用,请考虑使用磁盘数据库。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.