简体   繁体   English

大文件和python

[英]Big files and python

I have a large CSV file(300+ GB) with data (time, OS, ID).我有一个包含数据(时间、操作系统、ID)的大型 CSV 文件(300+ GB)。 What should I do to count the IDs that occur more than 1 time in this file?我应该怎么做才能计算此文件中出现次数超过 1 次的 ID? Which algorithm will not overflow?哪个算法不会溢出?

Simple for loop and reading by the line should do it简单的 for 循环和按行阅读应该这样做

result_set = set()
with open(filename, "r") as input_file:
    for line in input_file:
        # Expected format is smth, smth, important
        splitted_list = line.split(",")
        result_set.add(splitted_list[-1].strip())
print(result_set)
# If file is
# ---
# random, random, important1
# random, random, important2
# ---
# prints:
# {'important2', 'important1'}

Unlike readlines(), this does not load the whole file.与 readlines() 不同,这不会加载整个文件。 It will take it's sweet time but it won't crash.这将需要它的甜蜜时间,但它不会崩溃。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM