[英]Operating on a huge table: group of rows at a time using python
我有一個巨大的表文件,如下所示。 為了處理個別產品(名稱),我嘗試使用pandas groupby,但似乎把整個表(~10G)放在內存中,這是我買不起的。
name index change
A Q QQ
A Q QQ
A Q QQ
B L LL
C Q QQ
C L LL
C LL LL
C Q QQ
C L LL
C LL LL
名稱列排序很好,我一次只關心一個名稱。 我希望在“更改”列上使用以下條件來過濾每個名稱:
結果表:
name index change
A Q QQ
A Q QQ
A Q QQ
C L LL
C LL LL
C L LL
C LL LL
結果表:
name index change
A Q QQ
A Q QQ
A Q QQ
我想知道是否有辦法按名稱處理表名,並在每個名稱后釋放內存。 (並且不必一步一步地執行這兩個標准。)任何提示或建議將不勝感激!
因為文件按“名稱”排序,所以您可以逐行讀取文件:
def process_name(name, data, output_file):
group_by = {}
for index, change in data:
if index not in group_by:
group_by[index] = []
group_by[index].append(change)
# do the step 1 filter logic here
# do the step 2 filter logic here
for index in group_by:
if index == group_by[index]:
# Because there is at least one "no change" this
# whole "name" can be thrown out, so return here.
return
output = []
for index in group_by:
output_file.write("%s\t%s\t%s\n" % (name, index, group_by[index]))
current_name = None
current_data = []
input_file = open(input_filename, "r")
output_file = open(output_filename, "w")
header = input_file.readline()
for row in input_file:
cols = row.strip().split("\t")
name = cols[0]
index = cols[1]
change = cols[2]
if name != current_name:
if name != None:
process_name(current_name, current_data, output_file)
current_name = name
current_data = []
current_data.append((index, change))
# process what's left in the buffer
if current_name is not None:
process_name(current_name, current_data, output_file)
input_file.close()
output_file.close()
我並不完全理解你在#1中解釋過的邏輯,所以我把它留空了。 我也覺得你可能想先做第2步,因為這會很快排除整個“名字”。
由於您的文件已經排序,並且您似乎只是按名稱操作子段,因此可能只需使用Python的groupby並為每個名稱段創建一個表:
from itertools import groupby
import pandas as pd
with open('/tmp/so.csv') as f:
header=next(f).split()
for k, segment in groupby(f, key=lambda line: line.split()[0]):
seg_data={k:[] for k in header}
for e in segment:
for key, v in zip(header, e.split()):
seg_data[key].append(v)
seg_fram=pd.DataFrame.from_dict(seg_data)
print k
print seg_fram
print
打印:
A
change index name
0 QQ Q A
1 QQ Q A
2 QQ Q A
B
change index name
0 LL L B
C
change index name
0 QQ Q C
1 LL L C
2 LL LL C
3 QQ Q C
4 LL L C
5 LL LL C
然后,您將擁有的最大內存將由最大的連續組決定,而不是文件的大小。
您可以通過逐行附加到數據框而不是構建中間字典來使用該方法的內存的1/2:
with open('/tmp/so.csv') as f:
header=next(f).split()
for k, segment in groupby(f, key=lambda line: line.split()[0]):
seg_data={k:[] for k in header}
seg_fram=pd.DataFrame(columns=header)
for idx, e in enumerate(segment):
df=pd.DataFrame({k:v for k, v in zip(header, e.split())}, index=[idx])
seg_fram=seg_fram.append(df)
(可能會慢一些......)
如果這不起作用,請考慮使用磁盤數據庫。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.