如何读取和处理 Python 中对于 memory 来说太大的文件？

Question

I have csv file that looks like:我有 csv 文件，如下所示：

1,2,0.2
1,3,0.4
2,1,0.5
2,3,0.8
3,1,0.1
3,2,0.6

First column correspond to user_a , second to user_b and third correspond to score .第一列对应于user_a ，第二列对应于user_b ，第三列对应于score 。 I want to find for every user_a , a user_b value that maximizes the score .我想为每个user_a找到一个最大化score的user_b值。 For this example output should look like (output in form of dictionary preferable but not requred):对于这个例子 output 应该看起来像（最好以字典的形式输出，但不是必需的）：

1 3 0.4
2 3 0.8
3 2 0.6

The problem is that file is very big (millions of rows) and I try to find way to do it without out of memory error.问题是文件非常大（数百万行），我试图找到方法来做到这一点而不会出现 memory 错误。 Because of environment setup I cannot use Pandas , Dask and other packages with dataframes.由于环境设置，我无法使用Pandas 、 Dask 和其他带有数据帧的包。

I created code to read large file line by line:我创建了代码来逐行读取大文件：

def read_large_data(file_name, sep=","):
    for line in open(file_name, "r"):
        a, b, s = line.rstrip().split(sep)
        yield a, b, s

And code for finding max score:以及查找最高分的代码：

def find_max_score(file_name, sep=","):
    result = {"User_A": [], "User_B": [], "Score": []}
    read_file_gen = read_large_data(file_name, sep)
    while True:
        try:
            a, b, s = next(read_file_gen)
            if a not in result["User_A"]:
                result["User_A"].append(a)
                result["User_B"].append(b)
                result["Score"].append(s)
            else:
                ind = result["User_A"].index(a)
                if s > result["Score"][ind]:
                    result["User_B"][ind] = b
                    result["Score"][ind] = s
        except StopIteration:
            break
    return result

I used the yield function to keep the memory needed for computation, but I still get an out-of-memory error.我使用 yield function 来保留计算所需的 memory，但我仍然遇到内存不足的错误。 Any advice on how to reduce memory consumption would be highly appreciated.非常感谢有关如何减少 memory 消耗的任何建议。

Answer 1

In a comment you are saying "Yes, correct, the real data is ordered pairs and sorted" .在评论中，您说“是的，正确的，真实数据是有序对和排序的” 。 So why can't you just do the following:那么，为什么你不能只执行以下操作：

import csv
from itertools import groupby
from operator import itemgetter

def max_key(row): return float(row[2])

def find_max_score(file_name, sep=","):
    result = {"User_A": [], "User_B": [], "Score": []}
    with open(file_name, "r") as file:
        reader = csv.reader(file, delimiter=sep)
        for user_a, rows in groupby(reader, key=itemgetter(0)):
            _, user_b, score = max(rows, key=max_key)
            result["User_A"].append(user_a)
            result["User_B"].append(user_b)
            result["Score"].append(score)
    return result

如何读取和处理 Python 中对于 memory 来说太大的文件？

问题描述

1 个解决方案

解决方案1
1 2023-01-26 12:38:49

如何读取和处理 Python 中对于 memory 来说太大的文件？

问题描述

1 个解决方案

解决方案1 1 2023-01-26 12:38:49

解决方案1
1 2023-01-26 12:38:49