大文件和python

Question

I have a large CSV file(300+ GB) with data (time, OS, ID).我有一个包含数据（时间、操作系统、ID）的大型 CSV 文件（300+ GB）。 What should I do to count the IDs that occur more than 1 time in this file?我应该怎么做才能计算此文件中出现次数超过 1 次的 ID？ Which algorithm will not overflow?哪个算法不会溢出？

Answer 1

Simple for loop and reading by the line should do it简单的 for 循环和按行阅读应该这样做

result_set = set()
with open(filename, "r") as input_file:
    for line in input_file:
        # Expected format is smth, smth, important
        splitted_list = line.split(",")
        result_set.add(splitted_list[-1].strip())
print(result_set)
# If file is
# ---
# random, random, important1
# random, random, important2
# ---
# prints:
# {'important2', 'important1'}

Unlike readlines(), this does not load the whole file.与 readlines() 不同，这不会加载整个文件。 It will take it's sweet time but it won't crash.这将需要它的甜蜜时间，但它不会崩溃。

大文件和python

问题描述

1 个解决方案

解决方案1
1 2019-08-06 10:51:25

大文件和python

问题描述

1 个解决方案

解决方案1 1 2019-08-06 10:51:25

解决方案1
1 2019-08-06 10:51:25