简体   繁体   English

有效地从巨大的CSV文件中读取数据

[英]read data from a huge CSV file efficiently

I was trying to process my huge CSV file (more than 20G), but the process was killed when reading the whole CSV file into memory. 我试图处理巨大的CSV文件(大于20G),但是当将整个CSV文件读入内存时,该过程被终止。 To avoid this issue, I am trying to read the second column line by line. 为避免此问题,我尝试逐行读取第二列。

For example, the 2nd column contains data like 例如,第二列包含如下数据

  1. xxx, computer is good xxx,电脑很好
  2. xxx, build algorithm xxx,构建算法

     import collections wordcount = collections.Counter() with open('desc.csv', 'rb') as infile: for line in infile: wordcount.update(line.split()) 

My code is working for the whole columns, how to only read the second column without using CSV reader? 我的代码适用于整个列,如何不使用CSV阅读器仅读取第二列?

As far as I know, calling csv.reader(infile) opens and reads the whole file...which is where your problem lies. 据我所知,调用csv.reader(infile)打开并读取整个文件...这就是问题所在。

You can just read line-by-line and parse manually: 您可以逐行阅读并手动解析:

X=[]

with open('desc.csv', 'r') as infile:    
   for line in infile:
      # Split on comma first
      cols = [x.strip() for x in line.split(',')]

      # Grab 2nd "column"
      col2 = cols[1]

      # Split on spaces
      words = [x.strip() for x in col2.split(' ')]
      for word in words:     
         if word not in X:
            X.append(word)

for w in X:
   print w

That will keep a smaller chunk of the file in memory at a given time (one line). 这将在给定时间(一行)将较小的文件块保留在内存中。 However, you may still potentially have problems with variable X increasing to quite a large size, such that the program will error out due to memory limits. 但是,变量X增大到很大的大小可能仍然会带来问题,从而由于内存限制,程序将出错。 Depends how many unique words are in your "vocabulary" list 取决于您的“词汇”列表中有多少个独特的单词

It looks like the code in your question is reading the 20G file and splitting each line into space separated tokens then creating a counter that keeps a count of every unique token. 您问题中的代码似乎正在读取20G文件,并将每一行拆分为以空格分隔的令牌,然后创建一个计数器,该计数器保留每个唯一令牌的计数。 I'd say that is where your memory is going. 我想这就是你的记忆力所在。

From the manual csv.reader is an iterator 从手册csv.reader是一个迭代器

a reader object which will iterate over lines in the given csvfile. 一个读取器对象,它将遍历给定csvfile中的行。 csvfile can be any object which supports the iterator protocol and returns a string each time its next() method is called csvfile可以是支持迭代器协议的任何对象,并且每次调用其next()方法时都返回一个字符串

so it is fine to iterate through a huge file using csv.reader . 因此,可以使用csv.reader遍历庞大的文件。

import collections

wordcount = collections.Counter()

with open('desc.csv', 'rb') as infile:
    for row in csv.reader(infile):
        # count words in strings from second column
        wordcount.update(row[1].split())

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM