使用 Pandas 的多处理来读取、修改和写入数千个 csv 文件

Question

So I have about 5000 csv files under one directory, which contains stocks' minutes data.所以我在一个目录下有大约 5000 个 csv 文件，其中包含股票的分钟数据。 Each file is named by their symbol.每个文件都以其符号命名。 like stock AAPL is named as AAPL.csv.像股票 AAPL 被命名为 AAPL.csv。

I try to do some clean up and editing on each of them.我尝试对它们中的每一个进行一些清理和编辑。 In this case, I try to convert one column which contains unix epoch datatime into readable date and time.在这种情况下，我尝试将包含 unix epoch 数据时间的一列转换为可读的日期和时间。 I also want to change a label of one column.我还想更改一列的标签。

I try to use multiprocessing to speed up the process.我尝试使用多处理来加快进程。 But first try just kill my Macbook.但首先尝试杀死我的Macbook。

I run it inside VScode's jupyter notebook.我在 VScode 的 jupyter notebook 中运行它。 If that matters.如果那很重要。

I wonder what did I do wrong and how to improve.我想知道我做错了什么以及如何改进。 And how to handle similar tasks in python and pandas.以及如何在 python 和 pandas 中处理类似的任务。

Thank you!谢谢！

Here is my code.这是我的代码。

# Define operations will be used in multiprocessing handling
def clean_up(file,fail_list):
    print('Working on {}'.format(file))
    stock = pd.read_csv('./Data/minutes_data/' + file)

    try:
        #Convert datetime columns into readable date and time column
        stock['Date'] = stock.apply(lambda row: epoch_converter.get_date_from_mill_epoch(row['datetime']), axis=1)
        stock['Time'] = stock.apply(lambda row: epoch_converter.get_time_from_mill_epoch(row['datetime']), axis=1)

        #Rename 'Unnamed: 0' column into 'Minute'
        stock.rename(columns={'Unnamed: 0':'Minute'}, inplace=True)

        #Write it back to new file
        stock.to_csv('./Data/working_data/' + file)
    except:
        print('{} not successful'.format(file))
        fail_list = fail_list.append(file)
        fail_list.to_csv('./failed_list.csv')



#Get file list to working on.
file_list = os.listdir('./Data/minutes_data/')

#prepare failed_list
fail_list = pd.DataFrame([])
#Loop through each file
processes = []
for file in file_list:
    p = multiprocessing.Process(target=clean_up, args=(file,fail_list,))
    processes.append(p)
    p.start()

for process in processes:
    process.join()

Update: CSV_FILE_SAMPLE更新：CSV_FILE_SAMPLE

,open,high,low,close,volume,datetime 0,21.9,21.9,21.9,21.9,200,1596722940000 0,20.0,20.0,19.9937,19.9937,200,1595266500000 1,20.0,20.0,19.9937,19.9937,500,1595266800000 2,20.0,20.0,19.9937,19.9937,1094,1595267040000 3,20.0,20.0,20.0,20.0,200,1595268240000 ,打开,高,低,关闭,成交量,日期时间 0,21.9,21.9,21.9,21.9,200,1596722940000 0,20.0,20.0,19.9937,19.9937,200,1595266,1090,30,700,90,70,90.90. 1595266800000 2,20.0,20.0,19.9937,19.9937,1094,1595267040000 3,20.0,20.0,20.0,20.0,200,1595200824

Final Update:最终更新：

Combine answers from @furas and @jsmart, the script managed to reduce processing time of 5000 csv from hours to under 1 minutes (Under 6 core i9 on Macbook pro).结合@furas 和@jsmart 的答案，该脚本设法将 5000 csv 的处理时间从几小时减少到 1 分钟以下（Macbook pro 上的 6 核 i9 以下）。 I'm happy.我很高兴。 You guys are awesome.你们真棒。 Thanks!谢谢！

The final scripts is here:最终脚本在这里：

import pandas as pd
import numpy as np
import os
import multiprocessing
import logging

logging.basicConfig(filename='./log.log',level=logging.DEBUG)

file_list = os.listdir('./Data/minutes_data/')

def cleanup(file):
    print('Working on ' + file)
    stock = pd.read_csv('./Data/minutes_data/' + file)
    
    try:
        #Convert datetime columns into readable date and time column
        stock['Date'] = pd.to_datetime(stock['datetime'],unit='ms',utc=True).dt.tz_convert('America/New_York').dt.date
        stock['Time'] = pd.to_datetime(stock['datetime'],unit='ms',utc=True).dt.tz_convert('America/New_York').dt.time

        #Rename 'Unnamed: 0' column into 'Minute'
        stock.rename(columns={'Unnamed: 0':'Minute'}, inplace=True)

        #Write it back to new file
        stock.to_csv('./Data/working_data/' + file)
    except:
        print(file + ' Not successful')
        logging.warning(file + ' Not complete.')



pool = multiprocessing.Pool()
pool.map(cleanup, file_list)

Answer 1

Using Process in loop you create 5000 process at the same time使用Process in loop 您可以同时创建 5000 个进程

You could use Pool to control how many processes works at the same time - and it will automatically free process with next file.您可以使用Pool来控制同时运行的进程数 - 它会自动释放下一个文件的进程。

It also can use return to send name of failed file to main process and it can save file once.它还可以使用return将失败文件的名称发送到主进程，并且可以保存文件一次。 Using the same file in many processes can makes wrong data in this file.在多个进程中使用同一个文件可能会在这个文件中产生错误的数据。 Besides processes don't share variables and every process will have own empty DataFrame and later will save only own failed file - so it will remove previous content.除了进程不共享变量，每个进程都有自己的空数据帧，稍后将只保存自己的失败文件 - 因此它将删除以前的内容。

def clean_up(file):
    # ... code ...
    
        return None  # if OK
    except:
        return file  # if failed
    
    
# --- main ---

# get file list to working on.
file_list = sorted(os.listdir('./Data/minutes_data/'))

with multiprocessing.Pool(10) as p:
    failed_files = p.map(clean_up, file_list)

# remove None from names
failed_files = filter(None, failed_files)

# save all
df = pd.DataFrame(failed_files)
df.to_csv('./failed_list.csv')

There is also multiprocessing.pool.ThreadPool which uses threads instead of processes .还有multiprocessing.pool.ThreadPool它使用threads而不是processes 。

Module concurrent.futures has also ThreadPoolExecutor and ProcessPoolExecutor模块concurrent.futures也有ThreadPoolExecutor和ProcessPoolExecutor

You can also try to do it with external modules - but I don't remeber which can be useful.你也可以尝试用外部模块来做 - 但我不记得哪个有用。

Answer 2

The original post asked "...how to handle similar tasks in python and pandas."原帖询问“...如何在 python 和 Pandas 中处理类似的任务。”

Replacing .apply(..., axis=1) can increase throughput by 100x or better.替换.apply(..., axis=1)可以将吞吐量提高 100 倍或更好。
Here is an example with 10_000 rows of data:这是一个包含 10_000 行数据的示例：

%%timeit
df['date'] = df.apply(lambda x: pd.to_datetime(x['timestamp'], unit='ms'), axis=1)
792 ms ± 26.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Re-write as:重写为：

%%timeit
df['date'] = pd.to_datetime(df['date'], unit='ms')
4.88 ms ± 38.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Sample data:样本数据：

print(df['timestamp'].head())
0    1586863008214
1    1286654914895
2    1436424291218
3    1423512988135
4    1413205308057
Name: timestamp, dtype: int64

使用 Pandas 的多处理来读取、修改和写入数千个 csv 文件

问题描述

Update: CSV_FILE_SAMPLE更新：CSV_FILE_SAMPLE

Final Update:最终更新：

2 个解决方案

解决方案1
2 已采纳 2020-09-05 07:04:20

解决方案2
1 2020-09-05 12:42:07

使用 Pandas 的多处理来读取、修改和写入数千个 csv 文件

问题描述

Update: CSV_FILE_SAMPLE更新：CSV_FILE_SAMPLE

Final Update:最终更新：

2 个解决方案

解决方案1 2 已采纳 2020-09-05 07:04:20

解决方案2 1 2020-09-05 12:42:07

解决方案1
2 已采纳 2020-09-05 07:04:20

解决方案2
1 2020-09-05 12:42:07