如何在这个 for 循环上使用多线程来减少执行时间？

Question

I have a folder that contains 1000's of folder under which there are 1000's of file.我有一个文件夹，其中包含 1000 个文件夹，其中有 1000 个文件。

cb = []

for root, dirs, files in os.walk(dir):
    for name in files:
        filepath = root + os.sep + name
        df = pd.read_csv(filepath,index_col=False)
        df['TimeStamp'] = pd.to_datetime(df.TimeStamp, format = '%Y-%m-%d %H:%M:%S')
        date = df['TimeStamp'].dt.date.values[0]
        time = df['TimeStamp'].dt.time.values[0]
        
        if (df.shape[0] > 0):
               cb.append({'Time': time, 'Date': date})

I need to open all the files and do some data processing on them and append the data to empty dataframe.我需要打开所有文件并对它们进行一些数据处理并将数据附加到空数据帧。

Doing it sequentially takes days to run, is there a way I can use multiprocessing/threading to reduce the time and not skipping any files in the process?按顺序运行需要几天时间，有没有办法可以使用多处理/线程来减少时间而不是跳过进程中的任何文件？

Answer 1

You can put the per-file work into a separate function and then use a multiprocessing pool to push the processing to separate processes.您可以将每个文件的工作放入单独的函数中，然后使用多处理池将处理推送到单独的进程。 This helps with CPU bound calculations but the file reads will take just as long as your original serial processing.这有助于 CPU 限制计算，但文件读取所需的时间与原始串行处理时间一样长。 The trick to multiprocessing is to keep the amount of data flowing through the pool itself to a minimum.多处理的技巧是将流经池本身的数据量保持在最低限度。 Since you only pass a file name and return a couple of date time objects in this example, you're good on that point.由于在此示例中您只传递了一个文件名并返回了几个日期时间对象，因此您在这一点上做得很好。

import multiprocessing as mp
import pandas as pd
import os

def worker(filepath):
    df = pd.read_csv(filepath,index_col=False)
    df['TimeStamp'] = pd.to_datetime(df.TimeStamp, format = '%Y-%m-%d %H:%M:%S')
    date = df['TimeStamp'].dt.date.values[0]
    time = df['TimeStamp'].dt.time.values[0]

    if (df.shape[0] > 0):
         return({'Time': time, 'Date': date})
    else:
        return None

if __name__ == "__main__":
    csv_files = [root + os.sep + name 
        for root, dirs, files in os.walk(dir)
        for name in files]
    with mp.Pool() as pool:
        cb = [result for result in pool.map(worker, csv_files, chunksize=1) 
                if result]

如何在这个 for 循环上使用多线程来减少执行时间？

问题描述

1 个解决方案

解决方案1
0 已采纳 2021-07-13 17:23:10

如何在这个 for 循环上使用多线程来减少执行时间？

问题描述

1 个解决方案

解决方案1 0 已采纳 2021-07-13 17:23:10

解决方案1
0 已采纳 2021-07-13 17:23:10