简体   繁体   中英

How to use multi threading on this for loop to decrease the execution time?

I have a folder that contains 1000's of folder under which there are 1000's of file.

cb = []

for root, dirs, files in os.walk(dir):
    for name in files:
        filepath = root + os.sep + name
        df = pd.read_csv(filepath,index_col=False)
        df['TimeStamp'] = pd.to_datetime(df.TimeStamp, format = '%Y-%m-%d %H:%M:%S')
        date = df['TimeStamp'].dt.date.values[0]
        time = df['TimeStamp'].dt.time.values[0]
        
        if (df.shape[0] > 0):
               cb.append({'Time': time, 'Date': date})

I need to open all the files and do some data processing on them and append the data to empty dataframe.

Doing it sequentially takes days to run, is there a way I can use multiprocessing/threading to reduce the time and not skipping any files in the process?

You can put the per-file work into a separate function and then use a multiprocessing pool to push the processing to separate processes. This helps with CPU bound calculations but the file reads will take just as long as your original serial processing. The trick to multiprocessing is to keep the amount of data flowing through the pool itself to a minimum. Since you only pass a file name and return a couple of date time objects in this example, you're good on that point.

import multiprocessing as mp
import pandas as pd
import os

def worker(filepath):
    df = pd.read_csv(filepath,index_col=False)
    df['TimeStamp'] = pd.to_datetime(df.TimeStamp, format = '%Y-%m-%d %H:%M:%S')
    date = df['TimeStamp'].dt.date.values[0]
    time = df['TimeStamp'].dt.time.values[0]

    if (df.shape[0] > 0):
         return({'Time': time, 'Date': date})
    else:
        return None

if __name__ == "__main__":
    csv_files = [root + os.sep + name 
        for root, dirs, files in os.walk(dir)
        for name in files]
    with mp.Pool() as pool:
        cb = [result for result in pool.map(worker, csv_files, chunksize=1) 
                if result]

    

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM