Python - 使用熊猫多处理多个大文件

Question

I have a y.csv file.我有一个y.csv文件。 The file size is 10 MB and it contains data from Jan 2020 to May 2020 .文件大小为 10 MB，包含Jan 2020 to May 2020的数据。

I also have a separate file for each month.我每个月也有一个单独的文件。 eg data-2020-01.csv .例如data-2020-01.csv 。 It contains detailed data.它包含详细的数据。 The file size of each month file is around 1 GB .每个月文件的文件大小约为1 GB 。

I'm splitting the y.csv by month and then process the data by loading the relevant month file.我按月拆分y.csv ，然后通过加载相关的月份文件来处理数据。 This process is taking too long when I go for large number of months.当我花了很多个月的时间时，这个过程花费的时间太长了。 eg 24 months.例如 24 个月。

I would like to process the data faster.我想更快地处理数据。 I have access to AWS m6i.8xlarge instance which has 32 vCPU and 128 GB memory.我可以访问具有32 vCPU和128 GB内存的 AWS m6i.8xlarge实例。

I'm new to multiprocessing.我是多处理的新手。 So can someone guide me here?那么有人可以在这里指导我吗？

This is my current code.这是我当前的代码。

import pandas as pd

periods = [(2020, 1), (2020, 2), (2020, 3), (2020, 4), (2020, 5)]

y = pd.read_csv("y.csv", index_col=0, parse_dates=True).fillna(0)  # Filesize: ~10 MB


def process(_month_df, _index):
    idx = _month_df.index[_month_df.index.get_loc(_index, method='nearest')]
    for _, value in _month_df.loc[idx:].itertuples():

        up_delta = 200
        down_delta = 200

        up_value = value + up_delta
        down_value = value - down_delta

        if value > up_value:
            y.loc[_index, "result"] = 1
            return

        if value < down_value:
            y.loc[_index, "result"] = 0
            return


for x in periods:
    filename = "data-" + str(x[0]) + "-" + str(x[1]).zfill(2)  # data-2020-01
    filtered_y = y[(y.index.month == x[1]) & (y.index.year == x[0])]  # Only get the current month records
    month_df = pd.read_csv(f'{filename}.csv', index_col=0, parse_dates=True)  # Filesize: ~1 GB (data-2020-01.csv)

    for index, row in filtered_y.iterrows():
        process(month_df, index)

Answer 1

As commented in multiple pandas/threading questions, CSV files being IO bound, you can get some benefit from using a ThreadPoolExecutor .正如在多个熊猫/线程问题中所评论的那样，CSV 文件是 IO 绑定的，您可以从使用ThreadPoolExecutor中获得一些好处。

At the same time, if you are going to perform aggregating operations, consider performing the read_csv also inside of your processor and use ProcessPoolExecutor instead.同时，如果您要执行聚合操作，请考虑在您的处理器内部也执行read_csv并改用ProcessPoolExecutor 。

If you are going to pass a lot of data between your multiprocesses you will also need a proper memory sharing method.如果您要在多进程之间传递大量数据，您还需要适当的内存共享方法。

However I see the use of iterrows and itertuples In general those two instructions make my eyes bleed.然而，我看到了iterrows和itertuples的使用一般来说，这两个指令让我的眼睛流血了。 Are you sure you cannot process the data in a vectorised mode?您确定不能以矢量化模式处理数据吗？

This particular section I am not sure what it is supposed to do, and having M rows will make it very slow.这个特定的部分我不确定它应该做什么，并且有 M 行会使其非常慢。

def process(_month_df, _index):
    idx = _month_df.index[_month_df.index.get_loc(_index, method='nearest')]
    for _, value in _month_df.loc[idx:].itertuples():

        up_delta = 200
        down_delta = 200

        up_value = value + up_delta
        down_value = value - down_delta

        if value > up_value:
            y.loc[_index, "result"] = 1
            return

        if value < down_value:
            y.loc[_index, "result"] = 0
            return

Below a vectorized code to find if it is going up or down, and in what row在矢量化代码下方查找它是上升还是下降，以及在哪一行

df=pd.DataFrame({'vals': np.random.random(int(10))*1000+5000}).astype('int64')
print(df.vals.values)

up_value = 6000
down_value = 3000
valsup = df.vals.values + 200*np.arange(df.shape[0])+200
valsdown = df.vals.values - 200*np.arange(df.shape[0])-200

#! argmax returns 0 if all false
# idx_up = np.argmax(valsup > up_value)
# idx_dwn= np.argmax(valsdown < down_value)

idx_up = np.argwhere(valsup > up_value)
idx_dwn= np.argwhere(valsdown < down_value)
idx_up = idx_up[0][0] if len(idx_up) else -1
idx_dwn = idx_dwn[0][0] if len(idx_dwn) else -1


if idx_up < 0 and idx_dwn<0:
    print(f" Not up nor down")
if idx_up < idx_dwn or idx_dwn<0:
    print(f" Result is positive, in position {idx_up}")
else: 
    print(f" Result is negative, in position {idx_dwn}")

For the sake of completeness, benchmarking itertuples() and the argwhere approach for 1000 elements:为了完整起见，对 1000 个元素的itertuples()和argwhere方法进行基准测试：

.itertuples() : 757µs .itertuples() : 757µs
arange + argwhere : 60µs范围 + argwhere ： arange

Answer 2

A multithreading pool would be ideal for sharing the y dataframe among threads (obviating the need for using shared memory) but is not so good at running the more CPU-intensive processing in parallel.多线程池非常适合在线程之间共享y数据帧（无需使用共享内存），但不太擅长并行运行 CPU 密集型处理。 A multiprocessing pool is great for doing CPU-intensive processing but not so great in sharing data across processes without coming up with a shred memory representation of your y dataframe.多处理池非常适合进行 CPU 密集型处理，但在跨进程共享数据方面不太好，而没有提出y数据帧的碎片内存表示。

Here I have rearranged your code so that I use a multithreading pool to create filtered_y for each period (which is a CPU-intensive operation, but pandas does release the Global Interpreter Lock for certain operations -- hopefully this one).在这里，我重新排列了您的代码，以便我使用多线程池为每个周期创建filtered_y （这是一个 CPU 密集型操作，但 pandas 确实为某些操作释放了全局解释器锁——希望是这个）。 Then we are only passing one-months worth of data to a multiprocessing pool, rather than the entire y dataframe, to process that month with worker function process_month .然后我们只将一个月的数据传递给多处理池，而不是整个y数据帧，以使用工作函数process_month处理该月。 But since each pool process does not have access to the y dataframe, it just returns the indices that need to be updated with the values to be replaced.但由于每个池进程无权访问y数据帧，它只返回需要用要替换的值更新的索引。

import pandas as pd
from multiprocessing.pool import Pool, ThreadPool, cpu_count

def process_month(period, filtered_y):
    """
    returns a list of tuples consisting of (index, value) pairs
    """
    filename = "data-" + str(period[0]) + "-" + str(period[1]).zfill(2)  # data-2020-01
    month_df = pd.read_csv(f'{filename}.csv', index_col=0, parse_dates=True)  # Filesize: ~1 GB (data-2020-01.csv)
    results = []
    for index, row in filtered_y.iterrows():   
        idx = month_df.index[month_df.index.get_loc(index, method='nearest')]
        for _, value in month_df.loc[idx:].itertuples():
    
            up_delta = 200
            down_delta = 200
    
            up_value = value + up_delta
            down_value = value - down_delta
    
            if value > up_value:
                results.append((index, 1))
                break
    
            if value < down_value:
                results.append((index, 0))
                break
    return results

def process(period):
    filtered_y = y[(y.index.month == period[1]) & (y.index.year == period[0])]  # Only get the current month records
    for index, value in multiprocessing_pool.apply(process_month, (period, filtered_y)):
        y.loc[index, "result"] = value

def main():
    global y, multiprocessing_pool

    periods = [(2020, 1), (2020, 2), (2020, 3), (2020, 4), (2020, 5)]
    y = pd.read_csv("y.csv", index_col=0, parse_dates=True).fillna(0)  # Filesize: ~10 MB

    MAX_THREAD_POOL_SIZE = 100
    thread_pool_size = min(MAX_THREAD_POOL_SIZE, len(periods))
    multiprocessing_pool_size = min(thread_pool_size, cpu_count())
    with Pool(multiprocessing_pool_size) as multiprocessing_pool, \
    ThreadPool(thread_pool_size) as thread_pool:
        thread_pool.map(process, periods)
        
    # Presumably y gets written out again as a CSV file here?

# Required for Windows:
if __name__ == '__main__':
    main()

Version Using Just a Single Multiprocessing Pool仅使用单个多处理池的版本

import pandas as pd
from multiprocessing.pool import Pool, ThreadPool, cpu_count

def process_month(period):
    """
    returns a list of tuples consisting of (index, value) pairs
    """
    y = pd.read_csv("y.csv", index_col=0, parse_dates=True).fillna(0)  # Filesize: ~10 MB
    filtered_y = y[(y.index.month == period[1]) & (y.index.year == period[0])]  # Only get the current month records
    filename = "data-" + str(period[0]) + "-" + str(period[1]).zfill(2)  # data-2020-01
    month_df = pd.read_csv(f'{filename}.csv', index_col=0, parse_dates=True)  # Filesize: ~1 GB (data-2020-01.csv)
    results = []
    for index, row in filtered_y.iterrows():   
        idx = month_df.index[month_df.index.get_loc(index, method='nearest')]
        for _, value in month_df.loc[idx:].itertuples():
    
            up_delta = 200
            down_delta = 200
    
            up_value = value + up_delta
            down_value = value - down_delta
    
            if value > up_value:
                results.append((index, 1))
                break
    
            if value < down_value:
                results.append((index, 0))
                break
    return results

def main():
    periods = [(2020, 1), (2020, 2), (2020, 3), (2020, 4), (2020, 5)]

    multiprocessing_pool_size = min(len(periods), cpu_count())
    with Pool(multiprocessing_pool_size) as multiprocessing_pool:
        results_list = multiprocessing_pool.map(process_month, periods)
    y = pd.read_csv("y.csv", index_col=0, parse_dates=True).fillna(0)  # Filesize: ~10 MB
    for results in results_list:
        for index, value in results:
            y.loc[index, "result"] = value
    # Write out new csv file:
    ...

# Required for Windows:
if __name__ == '__main__':
    main()

And now for a variation of this that uses a bit more memory but allows the main process to overlap its processing with the multiprocessing pool.现在对于这种情况的变体，它使用更多内存，但允许主进程将其处理与多处理池重叠。 This could be beneficial if the number of indices needing to be updated is quite large:如果需要更新的索引数量很大，这可能是有益的：

...
def main():
    periods = [(2020, 1), (2020, 2), (2020, 3), (2020, 4), (2020, 5)]

    multiprocessing_pool_size = min(len(periods), cpu_count() - 1) # save a core for the main process
    y = pd.read_csv("y.csv", index_col=0, parse_dates=True).fillna(0)  # Filesize: ~10 MB
    with Pool(multiprocessing_pool_size) as multiprocessing_pool:
        # Process values as soon as they are returned:
        for results in multiprocessing_pool.imap_unordered(process_month, periods):
            for index, value in results:
                y.loc[index, "result"] = value
    # Write out new csv file:
    ...

This last version could be superior since it first reads the csv file before submitting tasks to the pool and depending on the platform and how it caches I/O operations it could result in the worker function not having to do any physical I/O to read in its copies of the file.最后一个版本可能会更好，因为它在将任务提交到池之前首先读取 csv 文件，并且取决于平台以及它如何缓存 I/O 操作，它可能导致工作函数不必执行任何物理 I/O 来读取在其文件副本中。 But that is one more 10M file that has been read into memory.但这又是一个已读入内存的 10M 文件。

Python - 使用熊猫多处理多个大文件

问题描述

2 个解决方案

解决方案1
1 2022-05-22 13:26:50

解决方案2
1 已采纳 2022-05-22 15:23:16

Python - 使用熊猫多处理多个大文件

问题描述

2 个解决方案

解决方案1 1 2022-05-22 13:26:50

解决方案2 1 已采纳 2022-05-22 15:23:16

解决方案1
1 2022-05-22 13:26:50

解决方案2
1 已采纳 2022-05-22 15:23:16