使用Python或php从大型CSV文件中计数唯一列值

Question

我有一个217gb的Csv文件，如何在没有超时的情况下使用python或php脚本获取唯一列值的计数？

Answer 1

Not sure what you mean by timeout, for big files like this it will always take a long time. 不确定超时是什么意思，对于像这样的大文件，它将总是需要很长时间。

tokens = {}
with open("your.csv") as infile:
    for line in infile:
        columns = line.split(',')
        # Where idx is your desired column index
        if columns[idx] not in tokens:
            tokens[columns[idx]] = 0
        else:
            tokens[columns[idx]] += 1

print tokens

This loads the file line by line, so your compute doesn't crash from loading the whole 217 Gb into ram. 这将逐行加载文件，因此您的计算不会因将整个217 Gb加载到ram而崩溃。 You can try this first to see if the dictionary fits in your computer's memory. 您可以先尝试一下，看看字典是否适合您的计算机内存。 Otherwise you might wanna consider splitting the files to smaller chunks in a divide and conquer approach. 否则，您可能想考虑采用分而治之的方法将文件分割成较小的块。

Answer 2

You could try to increase the field_size_limit 您可以尝试增加field_size_limit

import csv
csv.field_size_limit(1000000000)

r = csv.reader(open('doc.csv', 'rb'))

for row in r:
    print(row)  # do the processing

使用Python或php从大型CSV文件中计数唯一列值

问题描述

2 个解决方案

解决方案1
1 2016-04-26 06:19:59

解决方案2
-1 2016-04-26 06:20:08

使用Python或php从大型CSV文件中计数唯一列值

问题描述

2 个解决方案

解决方案1 1 2016-04-26 06:19:59

解决方案2 -1 2016-04-26 06:20:08

解决方案1
1 2016-04-26 06:19:59

解决方案2
-1 2016-04-26 06:20:08