简体   繁体   中英

Read multiple CSV files in Pandas in chunks

How to import and read multiple CSV in chunks when we have multiple csv files and total size of all csv is around 20gb?

I don't want to use Spark as i want to use a model in SkLearn so I want the solution in Pandas itself.

My code is:

allFiles = glob.glob(os.path.join(path, "*.csv"))
df = pd.concat((pd.read_csv(f,sep=",") for f in allFiles))
df.reset_index(drop=True, inplace=True)

But this is failing as the total size of all the csv in my path is 17gb.

I want to read it in chunks but I getting some error if I try like this:

  allFiles = glob.glob(os.path.join(path, "*.csv"))
  df = pd.concat((pd.read_csv(f,sep=",",chunksize=10000) for f in allFiles))
  df.reset_index(drop=True, inplace=True)

The error I get is this:

"cannot concatenate object of type ""; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid"

Can someone help?

to read large csv file , you could use chunksize but in this case you have to use iterator like this:

for df in pd.read_csv('file.csv', sep=',', iterator=True, chunksize=10000):
    process(df)

you have to concat or append each chunk

or you could do that:

df = pd.read_csv('file.csv',, sep=',', iterator=True, chunksize=10000)
for chunk in df:
    process(chunk)

to read multiple file: for example

listfile = ['file1,'file2]
dfx = pd.DataFrame()
def process(d):
    #dfx=dfx.append(d) or dfx = pd.concat(dfx, d)
    #other coding

for f in listfile:
    for df in pd.read_csv(f, sep=',', iterator=True, chunksize=10000):
        process(df)

after you have lot of files you could use DASK or Pool from multiprocessing library to launch lot of reading process

Anyways, either you have enough memory, either you loss time

This is an interesting question. I haven't tried this, but I think the code would look something like the script below.

import pandas as pd
import csv
import glob
import os

#os.chdir("C:\\your_path\\")
results = pd.DataFrame([])
filelist = glob.glob("C:\\your_path\\*.csv")
#dfList=[]
for filename in filelist:
    print(filename)  
    namedf = pd.read_csv(filename, skiprows=0, index_col=0)
    results = results.append(namedf)

results.to_csv('C:\\your_path\\Combinefile.csv')


chunksize = 10 ** 6
for chunk in pd.read_csv('C:\\your_path\\Combinefile.csv', chunksize=chunksize):
    process(chunk)

Maybe you could load everything into memory and process it directly, but it would probably take a lot longer to process everything.

One way to do this is to chunk the data frame with pd.read_csv(file, chunksize=chunksize) and then if the last chunk you read is shorter than the chunksize, save the extra bit and then add it onto the first file of the next chunk.

But making sure to read in a smaller first chunk of the next file so that it equals the total chunk size.

def chunk_from_files(dir, master_chunksize):
    '''
    Provided a directory, loops through files and chunks out dataframes.
    :param dir: Directory to csv files.
    :param master_chunksize: Size of chunk to output.
    :return: Dataframes with master_chunksize chunk.
    '''
    files = os.listdir(dir)

    chunksize = master_chunksize
    extra_chunk = None # Initialize the extra chunk.
    for file in files:
        csv_file = os.path.join(dir, file)

        # Alter chunksize if extra chunk is not None.
        if extra_chunk is not None:
            chunksize = master_chunksize - extra_chunk.shape[0]

        for chunk in pd.read_csv(csv_file, chunksize=chunksize):
            if extra_chunk is not None: 
                # Concatenate last small chunk of previous file with altered first chunk of next file.
                chunk = pd.concat([chunk, extra_chunk])
                extra_chunk = None
                chunksize = master_chunksize # Reset chunksize.
            elif chunk.shape[0] < chunksize:
                # If last chunk is less than chunk size, set is as the extra bit.
                extra_chunk = chunk
                break

            yield chunk

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM