简体   繁体   English

如何读取和处理 Python 中对于 memory 来说太大的文件?

[英]How to read and process a file in Python that is too big for memory?

I have csv file that looks like:我有 csv 文件,如下所示:

1,2,0.2
1,3,0.4
2,1,0.5
2,3,0.8
3,1,0.1
3,2,0.6

First column correspond to user_a , second to user_b and third correspond to score .第一列对应于user_a ,第二列对应于user_b ,第三列对应于score I want to find for every user_a , a user_b value that maximizes the score .我想为每个user_a找到一个最大化scoreuser_b值。 For this example output should look like (output in form of dictionary preferable but not requred):对于这个例子 output 应该看起来像(最好以字典的形式输出,但不是必需的):

1 3 0.4
2 3 0.8
3 2 0.6

The problem is that file is very big (millions of rows) and I try to find way to do it without out of memory error.问题是文件非常大(数百万行),我试图找到方法来做到这一点而不会出现 memory 错误。 Because of environment setup I cannot use Pandas , Dask and other packages with dataframes.由于环境设置,我无法使用Pandas 、 Dask 和其他带有数据帧的包。

I created code to read large file line by line:我创建了代码来逐行读取大文件:

def read_large_data(file_name, sep=","):
    for line in open(file_name, "r"):
        a, b, s = line.rstrip().split(sep)
        yield a, b, s

And code for finding max score:以及查找最高分的代码:

def find_max_score(file_name, sep=","):
    result = {"User_A": [], "User_B": [], "Score": []}
    read_file_gen = read_large_data(file_name, sep)
    while True:
        try:
            a, b, s = next(read_file_gen)
            if a not in result["User_A"]:
                result["User_A"].append(a)
                result["User_B"].append(b)
                result["Score"].append(s)
            else:
                ind = result["User_A"].index(a)
                if s > result["Score"][ind]:
                    result["User_B"][ind] = b
                    result["Score"][ind] = s
        except StopIteration:
            break
    return result

I used the yield function to keep the memory needed for computation, but I still get an out-of-memory error.我使用 yield function 来保留计算所需的 memory,但我仍然遇到内存不足的错误。 Any advice on how to reduce memory consumption would be highly appreciated.非常感谢有关如何减少 memory 消耗的任何建议。

In a comment you are saying "Yes, correct, the real data is ordered pairs and sorted" .在评论中,您说“是的,正确的,真实数据是有序对和排序的” So why can't you just do the following:那么,为什么你不能只执行以下操作:

import csv
from itertools import groupby
from operator import itemgetter

def max_key(row): return float(row[2])

def find_max_score(file_name, sep=","):
    result = {"User_A": [], "User_B": [], "Score": []}
    with open(file_name, "r") as file:
        reader = csv.reader(file, delimiter=sep)
        for user_a, rows in groupby(reader, key=itemgetter(0)):
            _, user_b, score = max(rows, key=max_key)
            result["User_A"].append(user_a)
            result["User_B"].append(user_b)
            result["Score"].append(score)
    return result

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM