简体   繁体   中英

How can I split csv files in python?

Because of the memory error, i have to split my csv files. I did research it. I found it from one of the stack overflow user who is Aziz Alto. This is his code.

csvfile = open('#', 'r').readlines()
filename = 1
for i in range(len(csvfile)):
if i % 10000000 == 0:
    open(str(filename) + '.csv', 'w+').writelines(csvfile[i:i+10000000])
    filename += 1

It works well but for second file, the code did not add header which is very important for me. My question is that How can I add header for second file?

import pandas as pd 
rows = pd.read_csv("csvfile.csv", chunksize=5000000) 
for i, chuck in enumerate(rows): 
    chuck.to_csv('out{}.csv'.format(i)) # i is for chunk number of each iteration 

chucksize you specify how many rows you want- in excel you can have upto 1,048,576 rows. This will save it as 5000000 and with header.

hope this Helps!!

On the 2nd till last file you have to always add the 1st line of your original file (the one containing the header):

# this loads the first file fully into memory
with open('#', 'r') as f:
    csvfile = f.readlines()

linesPerFile = 1000000
filename = 1
# this is better then your former loop, it loops in 1000000 lines a peice,
# instead of incrementing 1000000 times and only write on the millionth one
for i in range(0,len(csvfile),linesPerFile):
    with open(str(filename) + '.csv', 'w+') as f:
        if filename > 1: # this is the second or later file, we need to write the
            f.write(csvfile[0]) # header again if 2nd.... file
        f.writelines(csvfile[i:i+linesPerFile])
    filename += 1

Fast csv file splitting

If you have a very big file and you have to try different partitions (say to find the best way to split it) the above solutions are too slow to try.

Another way to solve this (and a very fast one) is to create an index file by record number. It takes about six minutes to create an index file of a csv file of 6867839 rows and 9 Gb , and an additional 2 minutes for joblib to store it on disk.

This method is particularly impressive if you are dealing with huge files, like 3 Gb or more.

Here's the code for creating the index file:

# Usage:

# creaidx.py filename.csv

# indexes a csv file by record number. This can be used to
# access any record directly or to split a file without the 
# need of reading it all. The index file is joblib-stored as
# filename.index

# filename.csv is the file to create index for

import os,sys,joblib

BLKSIZE=512

def checkopen(s,m='r',bz=None):
    if os.access(s,os.F_OK):
        if bz==None:
            return open(s,m)     # returns open file
        else:
            return open(s,m,bz)  # returns open file with buffer size
    else:
        return None

def get_blk():
    global ix,off,blk,buff
    while True:            # dealing with special cases
        if ix==0:
            n=0
            break
        if buff[0]==b'\r':
            n=2
            off=0
            break
        if off==BLKSIZE-2:
            n=0
            off=0
            break
        if off==BLKSIZE-1:
            n=0
            off=1
            break
        n=2
        off=buff.find(b'\r')
        break
    while (off>=0 and off<BLKSIZE-2):
        idx.append([ix,blk,off+n]) 
#        g.write('{},{},{}\n'.format(ix,blk,off+n)) 
        print(ix,end='\r')
        n=2
        ix+=1
        off= buff.find(b'\r',off+2)

def crea_idx():
    global buff,blk
    buff=f.read(BLKSIZE)
    while len(buff)==BLKSIZE:
        get_blk()
        buff=f.read(BLKSIZE)
        blk+=1        
    get_blk()
    idx[-1][2]=-1 
    return

if len(sys.argv)==1:
    sys.exit("Need to provide a csv filename!")
ix=0
blk=0
off=0
idx=[]
buff=b'0'
s=sys.argv[1]
f=checkopen(s,'rb')
idxfile=s.replace('.csv','.index')
if checkopen(idxfile)==None:
    with open(idxfile,'w') as g:
            crea_idx()
            joblib.dump(idx,idxfile)
else:
    if os.path.getctime(idxfile)<os.path.getctime(s):
        with open(idxfile,'w') as g:
            crea_idx()
            joblib.dump(idx,idxfile)
f.close()

Let's use a toy example:

strings,numbers,colors
string1,1,blue
string2,2,red
string3,3,green
string4,4,yellow

The index file will be:

   [[0, 0, 0], 
    [1, 0, 24], 
    [2, 0, 40], 
    [3, 0, 55], 
    [4, 0, 72], 
    [5, 0, -1]]

Note the -1 at the last index element to indicate end of index file in case of a sequential access. You can use a tool like this to access any individual row of the csv file:

def get_rec(n=1,binary=False):
    n=1 if n<0 else n+1
    s=b'' if binary else '' 
    if len(idx)==0:return ''
    if idx[n-1][2]==-1:return ''
    f.seek(idx[n-1][1]*BLKSIZE+idx[n-1][2])
    buff=f.read(BLKSIZE)
    x=buff.find(b'\r')
    while x==-1:
        s=s+buff if binary else s+buff.decode()
        buff=f.read(BLKSIZE)
        x=buff.find(b'\r')
    return s+buff[:x]+b'\r\n' if binary else s+buff[:x].decode()

The first field of the index record is obviously unnecessary. It is kept there for debugging purposes. As a side note, if you substitute this field by any field in the csv record and you sort the index file by that field, then you have the csv file sorted by that field if you use the index field to access the csv file.

Now, once you have you index file created you just call the following program with the filename (the one which index was created already) and a number between 1 and 100 which will be the percentage the file will be split at as command line parameters:

start_time = time.time()
BLKSIZE=512
WSIZE=1048576 # pow(2,20) 1Mb for faster reading/writing
import sys
import joblib
from common import Drv,checkopen
ix=0
blk=0
off=0
idx=[]
buff=b'0'
if len(sys.argv)<3:
    sys.exit('Argument missing!')
s=Drv+sys.argv[1]
if sys.argv[2].isnumeric():
    pct=int(sys.argv[2])/100
else:
    sys.exit('Bad percentage: '+sys.argv[2])

f=checkopen(s,'rb')
idxfile=s.replace('.csv','.index')
if checkopen(idxfile):
    print('Loading index...')
    idx=joblib.load(idxfile)
    print('Done loading index.')
else:
    sys.exit(idxfile+' does not exist.')
head=get_rec(0,True)
n=int(pct*(len(idx)-2))
off=idx[n+1][1]*BLKSIZE+idx[n+1][2]-len(head)-1
num=off//WSIZE
res=off%WSIZE
sout=s.replace('.csv','.part1.csv')
i=0
with open(sout,'wb') as g:
    g.write(head)
    f.seek(idx[1][1]*BLKSIZE+idx[1][2])
    for x in range(num):
        print(i,end='\r')
        i+=1
        buff=f.read(WSIZE)
        g.write(buff)
    buff=f.read(res)
    g.write(buff)
print()
i=0    
sout=s.replace('.csv','.part2.csv')    
with open(sout,'wb') as g:
    g.write(head)
    f.seek(idx[n+1][1]*BLKSIZE+idx[n+1][2])
    buff=f.read(WSIZE)
    while len(buff)==WSIZE:
        g.write(buff)
        print(i,end='\r')
        i+=1
        buff=f.read(WSIZE)
    g.write(buff)
    
end_time = time.time()

The file are created using blocks of 1048576 bytes. You can play with that figure to make file creation faster or to tailor it to machines with less memory resources.

The file is split only on two partitions, each of them having the header of the original file. It is not too difficult to change the code to make it split files into more than two partitions.

Finally to put things in perspective, to split a csv file of 6867839 rows and 9 Gb by 50%, it took me roughly 6 minutes to create the index file and another 2 minutes for joblib to store it on disk. It took 3 additional minutes to split the file.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM