There is a big CSV file (with first line as header), now I want to sample it in 100 pieces ( line_num%100
for example), how to do that efficiently with main memory constraint?
separate the file into 100 smaller one.Or every 1/100th line as sub file 1, every 2/100th line as sub file 2,...,every 100/100th line as file 100. to get 100 files with size about 600 M.
Not get 100 lines or a 1/100 size's sample.
I tried to execute like this:
fi = [open('split_data//%d.csv'%i,'w') for i in range(100)]
i = 0
with open('data//train.csv') as fin:
first = fin.readline()
for line in fin:
fi[i%100].write(line)
i = i + 1
for i in range(100):
fi[i].close()
But the file is too big to run it with limited memory, how to deal with it? I want to make it with one round~
(My code works but it consumes too much time and i mistakenly thought it collapsed, sorry for that~~)
To split a file into 100 parts as stated in comments ( I want to split the file to 100 parts in modulus'ing way ie range(200)-->| [0,100]; [1,101]; [2,102] and Yes, separate a big one to hundreds of smaller files )
import csv
files = [open('part_{}'.format(n), 'wb') for n in xrange(100)]
csvouts = [csv.writer(f) for f in files]
with open('yourcsv') as fin:
csvin = csv.reader(fin)
next(csvin, None) # Skip header
for rowno, row in enumerate(csvin):
csvouts[rowno % 100].writerow(row)
for f in files:
f.close()
You can islice
over the file with a step instead of modulus'ing the line number, eg:
import csv
from itertools import islice
with open('yourcsv') as fin:
csvin = csv.reader(fin)
# Skip header, and then return every 100th until file ends
for line in islice(csvin, 1, None, 100):
# do something with line
Example:
r = xrange(1000)
res = list(islice(r, 1, None, 100))
# [1, 101, 201, 301, 401, 501, 601, 701, 801, 901]
Based on @Jon Clements answer, I would also benchmark this variation:
import csv
from itertools import islice
with open('in.csv') as fin:
first = fin.readline() # discard the header
csvin = csv.reader( islice(fin, None, None, 100) ) # this line is the only difference
for row in csvin:
print row # do something with row
If you only want 100 samples, you can use this idea which just makes 100 reads at equally spaced locations within the file. This should work well for CSV files whose line lengths are essentially uniform.
def sample100(path):
with open(path) as fin:
end = os.fstat(fin.fileno()).st_size
fin.readline() # skip the first line
start = fin.tell()
step = (end - start) / 100
offset = start
while offset < end:
fin.seek(offset)
fin.readline() # this might not be a complete line
if fin.tell() < end:
yield fin.readline() # this is a complete non-empty line
else:
break # not really necessary...
offset = offset + step
for row in csv.reader( sample100('in.csv') ):
# do something with row
I think you can just open the same file 10 times and then manipulate (read) each one independently effectively splitting it up into sub-file without actually doing it.
Unfortunately this requires knowing in advance how many rows there are in the file and that requires reading the whole thing once to count them. On the other hand this should be relatively quick since no other processing takes place.
To illustrate and test this approach I created a simpler — only one item per row — and much smaller csv test file that looked something like this (the first line is the header row and not counted):
line_no
1
2
3
4
5
...
9995
9996
9997
9998
9999
10000
Here's the code and sample output:
from collections import deque
import csv
# count number of rows in csv file
# (this requires reading the whole file)
file_name = 'mycsvfile.csv'
with open(file_name, 'rb') as csv_file:
for num_rows, _ in enumerate(csv.reader(csv_file)): pass
rows_per_section = num_rows // 10
print 'number of rows: {:,d}'.format(num_rows)
print 'rows per section: {:,d}'.format(rows_per_section)
csv_files = [open(file_name, 'rb') for _ in xrange(10)]
csv_readers = [csv.reader(f) for f in csv_files]
map(next, csv_readers) # skip header
# position each file handle at its starting position in file
for i in xrange(10):
for j in xrange(i * rows_per_section):
try:
next(csv_readers[i])
except StopIteration:
pass
# read rows from each of the sections
for i in xrange(rows_per_section):
# elements are one row from each section
rows = [next(r) for r in csv_readers]
print rows # show what was read
# clean up
for i in xrange(10):
csv_files[i].close()
Output:
number of rows: 10,000
rows per section: 1,000
[['1'], ['1001'], ['2001'], ['3001'], ['4001'], ['5001'], ['6001'], ['7001'], ['8001'], ['9001']]
[['2'], ['1002'], ['2002'], ['3002'], ['4002'], ['5002'], ['6002'], ['7002'], ['8002'], ['9002']]
...
[['998'], ['1998'], ['2998'], ['3998'], ['4998'], ['5998'], ['6998'], ['7998'], ['8998'], ['9998']]
[['999'], ['1999'], ['2999'], ['3999'], ['4999'], ['5999'], ['6999'], ['7999'], ['8999'], ['9999']]
[['1000'], ['2000'], ['3000'], ['4000'], ['5000'], ['6000'], ['7000'], ['8000'], ['9000'], ['10000']]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.