[英]How to read and process a file in Python that is too big for memory?
I have csv file that looks like:我有 csv 文件,如下所示:
1,2,0.2
1,3,0.4
2,1,0.5
2,3,0.8
3,1,0.1
3,2,0.6
First column correspond to user_a
, second to user_b
and third correspond to score
.第一列对应于
user_a
,第二列对应于user_b
,第三列对应于score
。 I want to find for every user_a
, a user_b
value that maximizes the score
.我想为每个
user_a
找到一个最大化score
的user_b
值。 For this example output should look like (output in form of dictionary preferable but not requred):对于这个例子 output 应该看起来像(最好以字典的形式输出,但不是必需的):
1 3 0.4
2 3 0.8
3 2 0.6
The problem is that file is very big (millions of rows) and I try to find way to do it without out of memory error.问题是文件非常大(数百万行),我试图找到方法来做到这一点而不会出现 memory 错误。 Because of environment setup I cannot use Pandas , Dask and other packages with dataframes.
由于环境设置,我无法使用Pandas 、 Dask 和其他带有数据帧的包。
I created code to read large file line by line:我创建了代码来逐行读取大文件:
def read_large_data(file_name, sep=","):
for line in open(file_name, "r"):
a, b, s = line.rstrip().split(sep)
yield a, b, s
And code for finding max score:以及查找最高分的代码:
def find_max_score(file_name, sep=","):
result = {"User_A": [], "User_B": [], "Score": []}
read_file_gen = read_large_data(file_name, sep)
while True:
try:
a, b, s = next(read_file_gen)
if a not in result["User_A"]:
result["User_A"].append(a)
result["User_B"].append(b)
result["Score"].append(s)
else:
ind = result["User_A"].index(a)
if s > result["Score"][ind]:
result["User_B"][ind] = b
result["Score"][ind] = s
except StopIteration:
break
return result
I used the yield function to keep the memory needed for computation, but I still get an out-of-memory error.我使用 yield function 来保留计算所需的 memory,但我仍然遇到内存不足的错误。 Any advice on how to reduce memory consumption would be highly appreciated.
非常感谢有关如何减少 memory 消耗的任何建议。
In a comment you are saying "Yes, correct, the real data is ordered pairs and sorted" .在评论中,您说“是的,正确的,真实数据是有序对和排序的” 。 So why can't you just do the following:
那么,为什么你不能只执行以下操作:
import csv
from itertools import groupby
from operator import itemgetter
def max_key(row): return float(row[2])
def find_max_score(file_name, sep=","):
result = {"User_A": [], "User_B": [], "Score": []}
with open(file_name, "r") as file:
reader = csv.reader(file, delimiter=sep)
for user_a, rows in groupby(reader, key=itemgetter(0)):
_, user_b, score = max(rows, key=max_key)
result["User_A"].append(user_a)
result["User_B"].append(user_b)
result["Score"].append(score)
return result
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.