简体   繁体   中英

how to sample a very big CSV file(6GB)

There is a big CSV file (with first line as header), now I want to sample it in 100 pieces ( line_num%100 for example), how to do that efficiently with main memory constraint?

separate the file into 100 smaller one.Or every 1/100th line as sub file 1, every 2/100th line as sub file 2,...,every 100/100th line as file 100. to get 100 files with size about 600 M.

Not get 100 lines or a 1/100 size's sample.

I tried to execute like this:

fi  = [open('split_data//%d.csv'%i,'w') for i in range(100)]
i = 0
with open('data//train.csv') as fin:
    first = fin.readline()
    for line in fin:
        fi[i%100].write(line)
        i = i + 1
for i in range(100):
    fi[i].close()

But the file is too big to run it with limited memory, how to deal with it? I want to make it with one round~

(My code works but it consumes too much time and i mistakenly thought it collapsed, sorry for that~~)

To split a file into 100 parts as stated in comments ( I want to split the file to 100 parts in modulus'ing way ie range(200)-->| [0,100]; [1,101]; [2,102] and Yes, separate a big one to hundreds of smaller files )

import csv

files = [open('part_{}'.format(n), 'wb') for n in xrange(100)]
csvouts = [csv.writer(f) for f in files]
with open('yourcsv') as fin:
    csvin = csv.reader(fin)
    next(csvin, None) # Skip header
    for rowno, row in enumerate(csvin):
        csvouts[rowno % 100].writerow(row)

for f in files:
    f.close()

You can islice over the file with a step instead of modulus'ing the line number, eg:

import csv
from itertools import islice

with open('yourcsv') as fin:
    csvin = csv.reader(fin)
    # Skip header, and then return every 100th until file ends
    for line in islice(csvin, 1, None, 100):
        # do something with line

Example:

r = xrange(1000)
res = list(islice(r, 1, None, 100))
# [1, 101, 201, 301, 401, 501, 601, 701, 801, 901]

Based on @Jon Clements answer, I would also benchmark this variation:

import csv
from itertools import islice

with open('in.csv') as fin:
  first = fin.readline() # discard the header
  csvin = csv.reader( islice(fin, None, None, 100) )  # this line is the only difference
  for row in csvin:
    print row # do something with row

If you only want 100 samples, you can use this idea which just makes 100 reads at equally spaced locations within the file. This should work well for CSV files whose line lengths are essentially uniform.

def sample100(path):
  with open(path) as fin:
    end = os.fstat(fin.fileno()).st_size
    fin.readline()              # skip the first line
    start = fin.tell()
    step = (end - start) / 100
    offset = start
    while offset < end:
      fin.seek(offset)
      fin.readline()            # this might not be a complete line
      if fin.tell() < end:
        yield fin.readline()    # this is a complete non-empty line
      else:
        break                   # not really necessary...
      offset = offset + step

for row in csv.reader( sample100('in.csv') ):
  # do something with row

I think you can just open the same file 10 times and then manipulate (read) each one independently effectively splitting it up into sub-file without actually doing it.

Unfortunately this requires knowing in advance how many rows there are in the file and that requires reading the whole thing once to count them. On the other hand this should be relatively quick since no other processing takes place.

To illustrate and test this approach I created a simpler — only one item per row — and much smaller csv test file that looked something like this (the first line is the header row and not counted):

line_no
1
2
3
4
5
...
9995
9996
9997
9998
9999
10000

Here's the code and sample output:

from collections import deque
import csv

# count number of rows in csv file
# (this requires reading the whole file)
file_name = 'mycsvfile.csv'
with open(file_name, 'rb') as csv_file:
    for num_rows, _ in enumerate(csv.reader(csv_file)): pass
rows_per_section = num_rows // 10

print 'number of rows: {:,d}'.format(num_rows)
print 'rows per section: {:,d}'.format(rows_per_section)

csv_files = [open(file_name, 'rb') for _ in xrange(10)]
csv_readers = [csv.reader(f) for f in csv_files]
map(next, csv_readers)  # skip header

# position each file handle at its starting position in file
for i in xrange(10):
    for j in xrange(i * rows_per_section):
        try:
            next(csv_readers[i])
        except StopIteration:
            pass

# read rows from each of the sections
for i in xrange(rows_per_section):
    # elements are one row from each section
    rows = [next(r) for r in csv_readers]
    print rows  # show what was read

# clean up
for i in xrange(10):
    csv_files[i].close()

Output:

number of rows: 10,000
rows per section: 1,000
[['1'], ['1001'], ['2001'], ['3001'], ['4001'], ['5001'], ['6001'], ['7001'], ['8001'], ['9001']]
[['2'], ['1002'], ['2002'], ['3002'], ['4002'], ['5002'], ['6002'], ['7002'], ['8002'], ['9002']]
...
[['998'], ['1998'], ['2998'], ['3998'], ['4998'], ['5998'], ['6998'], ['7998'], ['8998'], ['9998']]
[['999'], ['1999'], ['2999'], ['3999'], ['4999'], ['5999'], ['6999'], ['7999'], ['8999'], ['9999']]
[['1000'], ['2000'], ['3000'], ['4000'], ['5000'], ['6000'], ['7000'], ['8000'], ['9000'], ['10000']]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM