简体   繁体   English

使用Python或php从大型CSV文件中计数唯一列值

[英]Count of unique column values from large CSV file using Python or php

我有一个217gb的Csv文件,如何在没有超时的情况下使用python或php脚本获取唯一列值的计数?

Not sure what you mean by timeout, for big files like this it will always take a long time. 不确定超时是什么意思,对于像这样的大文件,它将总是需要很长时间。

tokens = {}
with open("your.csv") as infile:
    for line in infile:
        columns = line.split(',')
        # Where idx is your desired column index
        if columns[idx] not in tokens:
            tokens[columns[idx]] = 0
        else:
            tokens[columns[idx]] += 1

print tokens

This loads the file line by line, so your compute doesn't crash from loading the whole 217 Gb into ram. 这将逐行加载文件,因此您的计算不会因将整个217 Gb加载到ram而崩溃。 You can try this first to see if the dictionary fits in your computer's memory. 您可以先尝试一下,看看字典是否适合您的计算机内存。 Otherwise you might wanna consider splitting the files to smaller chunks in a divide and conquer approach. 否则,您可能想考虑采用分而治之的方法将文件分割成较小的块。

You could try to increase the field_size_limit 您可以尝试增加field_size_limit

import csv
csv.field_size_limit(1000000000)

r = csv.reader(open('doc.csv', 'rb'))

for row in r:
    print(row)  # do the processing

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM