简体   繁体   中英

read data from a huge CSV file efficiently

I was trying to process my huge CSV file (more than 20G), but the process was killed when reading the whole CSV file into memory. To avoid this issue, I am trying to read the second column line by line.

For example, the 2nd column contains data like

  1. xxx, computer is good
  2. xxx, build algorithm

     import collections wordcount = collections.Counter() with open('desc.csv', 'rb') as infile: for line in infile: wordcount.update(line.split()) 

My code is working for the whole columns, how to only read the second column without using CSV reader?

As far as I know, calling csv.reader(infile) opens and reads the whole file...which is where your problem lies.

You can just read line-by-line and parse manually:

X=[]

with open('desc.csv', 'r') as infile:    
   for line in infile:
      # Split on comma first
      cols = [x.strip() for x in line.split(',')]

      # Grab 2nd "column"
      col2 = cols[1]

      # Split on spaces
      words = [x.strip() for x in col2.split(' ')]
      for word in words:     
         if word not in X:
            X.append(word)

for w in X:
   print w

That will keep a smaller chunk of the file in memory at a given time (one line). However, you may still potentially have problems with variable X increasing to quite a large size, such that the program will error out due to memory limits. Depends how many unique words are in your "vocabulary" list

It looks like the code in your question is reading the 20G file and splitting each line into space separated tokens then creating a counter that keeps a count of every unique token. I'd say that is where your memory is going.

From the manual csv.reader is an iterator

a reader object which will iterate over lines in the given csvfile. csvfile can be any object which supports the iterator protocol and returns a string each time its next() method is called

so it is fine to iterate through a huge file using csv.reader .

import collections

wordcount = collections.Counter()

with open('desc.csv', 'rb') as infile:
    for row in csv.reader(infile):
        # count words in strings from second column
        wordcount.update(row[1].split())

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM