有效地從巨大的CSV文件中讀取數據

Question

我試圖處理巨大的CSV文件（大於20G），但是當將整個CSV文件讀入內存時，該過程被終止。 為避免此問題，我嘗試逐行讀取第二列。

例如，第二列包含如下數據

xxx，電腦很好

xxx，構建算法

 import collections wordcount = collections.Counter() with open('desc.csv', 'rb') as infile: for line in infile: wordcount.update(line.split())

我的代碼適用於整個列，如何不使用CSV閱讀器僅讀取第二列？

Answer 1

據我所知，調用csv.reader(infile)打開並讀取整個文件...這就是問題所在。

您可以逐行閱讀並手動解析：

X=[]

with open('desc.csv', 'r') as infile:    
   for line in infile:
      # Split on comma first
      cols = [x.strip() for x in line.split(',')]

      # Grab 2nd "column"
      col2 = cols[1]

      # Split on spaces
      words = [x.strip() for x in col2.split(' ')]
      for word in words:     
         if word not in X:
            X.append(word)

for w in X:
   print w

這將在給定時間（一行）將較小的文件塊保留在內存中。 但是，變量X增大到很大的大小可能仍然會帶來問題，從而由於內存限制，程序將出錯。 取決於您的“詞匯”列表中有多少個獨特的單詞

Answer 2

您問題中的代碼似乎正在讀取20G文件，並將每一行拆分為以空格分隔的令牌，然后創建一個計數器，該計數器保留每個唯一令牌的計數。 我想這就是你的記憶力所在。

從手冊csv.reader是一個迭代器

一個讀取器對象，它將遍歷給定csvfile中的行。 csvfile可以是支持迭代器協議的任何對象，並且每次調用其next（）方法時都返回一個字符串

因此，可以使用csv.reader遍歷龐大的文件。

import collections

wordcount = collections.Counter()

with open('desc.csv', 'rb') as infile:
    for row in csv.reader(infile):
        # count words in strings from second column
        wordcount.update(row[1].split())

有效地從巨大的CSV文件中讀取數據

問題描述

2 個解決方案

解決方案1
1 2016-10-14 18:11:48

解決方案2
1 2017-08-16 01:18:23

有效地從巨大的CSV文件中讀取數據

問題描述

2 個解決方案

解決方案1 1 2016-10-14 18:11:48

解決方案2 1 2017-08-16 01:18:23

解決方案1
1 2016-10-14 18:11:48

解決方案2
1 2017-08-16 01:18:23