简体   繁体   English

在python中并行化for循环

[英]Parallelizing a for loop in python

I have a dictionary where each key (date) contains a table (multiple lists of the format [day1, val11, val21], [day2, va12, val22], [day3, val13, val23], ... . I want to transform it into a DataFrame; this is done with the following code: 我有一个字典,其中每个键(日期)包含一个表(格式[day1, val11, val21], [day2, va12, val22], [day3, val13, val23], ...多个列表[day1, val11, val21], [day2, va12, val22], [day3, val13, val23], ...我想要将其转换为DataFrame;使用以下代码完成:

df4 = pd.DataFrame(columns=sorted(set_days))

for date in dic.keys():
        days = [day  for day, val1, val2  in dic[date]]
        val1 = [val1 for day, val1, val2  in dic[date]]
        df4.loc[date, days] = val1

This code works fine, but it takes more than two hours to run. 此代码工作正常,但运行需要两个多小时。 After some research, I've realized I could parallelize it via the multiprocessing library; 经过一些研究,我意识到我可以通过multiprocessing库并行化它; the following code is the intended parallel version 以下代码是预期的并行版本

import multiprocessing

def func(date):
    global df4, dic
    days = [day  for day, val1, val2  in dic[date]]
    val1 = [val1 for day, val1, val2  in dic[date]]
    df4.loc[date, days] = val1

multiprocessing.Pool(processes=8).map(func, dic.keys())

The problem with this code is that, after executing multiprocessing.Pool(processes... , the df4 DataFrame is empty. 这段代码的问题在于,在执行multiprocessing.Pool(processes...df4 DataFrame为空。

Any help would be much appreciated. 任何帮助将非常感激。

Example

Suppose the dictionary contains two days: 假设字典包含两天:

dic['20030812'][:4]
Out: [[1, 24.25, 0.0], [20, 23.54, 23.54], [30, 23.13, 24.36], [50, 22.85, 23.57]]

dic['20030813'][:4]
Out: [[1, 24.23, 0.0], [19, 23.4, 22.82], [30, 22.97, 24.19], [49, 22.74, 23.25]]

then the DataFrame should be of the form: 然后DataFrame应该是以下形式:

df4.loc[:, 1:50]
             1    2    3    4    5   ...     46   47   48     49     50
20030812  24.25  NaN  NaN  NaN  NaN  ...    NaN  NaN  NaN    NaN  22.85
20030813  24.23  NaN  NaN  NaN  NaN  ...    NaN  NaN  NaN  22.74    NaN

Also, 也,

dic.keys()
Out[36]: dict_keys(['20030812', '20030813'])

df1.head().to_dict()
Out: 
{1: {'20030812': 24.25, '20030813': 24.23},
 2: {'20030812': nan, '20030813': nan},
 3: {'20030812': nan, '20030813': nan},
 4: {'20030812': nan, '20030813': nan},
 5: {'20030812': nan, '20030813': nan},
 6: {'20030812': nan, '20030813': nan},
 7: {'20030812': nan, '20030813': nan},
 8: {'20030812': nan, '20030813': nan},
 9: {'20030812': nan, '20030813': nan},
 10: {'20030812': nan, '20030813': nan},
 11: {'20030812': nan, '20030813': nan},
 12: {'20030812': nan, '20030813': nan},
 13: {'20030812': nan, '20030813': nan},
 14: {'20030812': nan, '20030813': nan},
 15: {'20030812': nan, '20030813': nan},
 16: {'20030812': nan, '20030813': nan},
 17: {'20030812': nan, '20030813': nan},
 18: {'20030812': nan, '20030813': nan},
 19: {'20030812': nan, '20030813': 23.4},
 20: {'20030812': 23.54, '20030813': nan},
 21: {'20030812': nan, '20030813': nan},
 22: {'20030812': nan, '20030813': nan},
 23: {'20030812': nan, '20030813': nan},
 24: {'20030812': nan, '20030813': nan},
 25: {'20030812': nan, '20030813': nan},
 26: {'20030812': nan, '20030813': nan},
 27: {'20030812': nan, '20030813': nan},
 28: {'20030812': nan, '20030813': nan},
 29: {'20030812': nan, '20030813': nan},
 30: {'20030812': 23.13, '20030813': 22.97},
 31: {'20030812': nan, '20030813': nan},
 32: {'20030812': nan, '20030813': nan},
 ...

To answer your original question (roughly: "Why is the df4 DataFrame empty?"), the reason this doesn't work is that when the Pool workers are launched, each worker inherits a personal copy-on-write view of the parent's data (either directly if multiprocessing is running on a UNIX-like system with fork , or via a kludgy approach to simulate it when running on Windows). 要回答你原来的问题(粗略地说:“为什么df4 DataFrame是空的?”),这个不起作用的原因是当启动Pool工作时,每个工作者都会继承父数据的个人写时复制视图(如果使用fork在类UNIX系统上运行multiprocessing ,或者在Windows上运行时通过kludgy方法模拟它,则直接运行)。

Thus, when each worker does: 因此,当每个工人做:

 df4.loc[date, days] = val1

it's mutating the worker's personal copy of df4 ; 它正在改变工人的df4个人副本; the parent process's copy remains untouched. 父进程的副本保持不变。

In general, there are three ways to handle this: 一般来说,有三种方法可以解决这个问题:

  1. Change your worker function to return something that can be used in the parent process. 更改工作器函数以返回可在父进程中使用的内容。 For example, instead of trying to perform in-place mutation with df4.loc[date, days] = val1 , return what's necessary to do it in the parent, eg return date, days, val1 , then change the parent to: 例如,不要尝试使用df4.loc[date, days] = val1执行就地变异, df4.loc[date, days] = val1返回在父项中执行此操作所需的内容,例如return date, days, val1 ,然后将父项更改为:

     for date, days, val in multiprocessing.Pool(processes=8).map(func, dic.keys()): df4.loc[date, days] = val 

    Downside to this approach is that it requires each of the return values to be pickled (Python's version of serialization), piped from child to parent, and unpickled; 这种方法的缺点是它需要对每个返回值进行pickle(Python的序列化版本),从子节点传输到父节点,以及unxled; if the worker task doesn't do very much work, especially if the return values are large (and in this case, that seems to be the case), it can easily spend more time on serialization and IPC than it gains in parallelism. 如果工作任务没有做很多工作,特别是如果返回值很大(在这种情况下,似乎就是这种情况),它可以很容易地花费更多的时间在序列化和IPC上而不是并行性。

  2. Using shared object/memory (demonstrated in this answer to "Multiprocessing writing to pandas dataframe" ). 使用共享对象/内存(在“对熊猫数据帧进行多处理写入”的回答中进行了演示)。 In practice, this usually doesn't gain you much, since stuff that isn't based on the more "raw" ctypes sharing using multiprocessing.sharedctypes is still ultimately going to end up needing to pipe data from one process to another; 在实践中,这通常不会给你带来太大的好处,因为不基于使用multiprocessing.sharedctypes共享的更“原始” ctypes东西最终仍然需要将数据从一个进程传输到另一个进程; sharedctypes based stuff can get a meaningful speed boost though, since once mapped, shared raw C arrays are nearly as fast to access as local memory. 基于sharedctypes的东西可以获得有意义的速度提升,因为一旦映射,共享原始C阵列几乎与本地内存一样快。

  3. If the work being parallelized is I/O bound, or uses third party C extensions for CPU bound work (eg numpy ), you may be able to get the required speed boosts from threads, despite GIL interference, and threads do share the same memory. 如果并行化的工作是I / O绑定,或者使用第三方C扩展来进行CPU绑定工作(例如numpy ),那么尽管存在GIL干扰,您仍然可以从线程获得所需的速度提升,并且线程确实共享相同的内存。 Your case doesn't appear to be either I/O bound or meaningfully dependent on third party C extensions which might release the GIL, so it probably won't help here, but in general, the simple way to switch from process-based parallelism to thread-based parallelism (when you're already using multiprocessing ) is to change the import from: 您的情况似乎不受I / O限制或有意义地依赖于可能释放GIL的第三方C扩展,因此它可能在这里没有帮助,但一般来说,从基于进程的并行性切换的简单方法基于线程的并行性(当你已经在使用multiprocessing )是改变import

     import multiprocessing 

    to

     import multiprocessing.dummy as multiprocessing 

    which imports the thread-backed version of multiprocessing under the expected name, so code seamlessly switches from using processes to threads. 它以预期的名称导入multiprocessing线程multiprocessing的线程支持版本 ,因此代码可以无缝地从使用进程切换到线程。

As RafaelC hinted, It was an XY problem. 正如RafaelC暗示的那样,这是一个XY问题。 I've been able to reduce the execution time to 20 seconds without multiprocessing. 没有多处理,我已经能够将执行时间缩短到20秒。

I created a lista list that replaces the dictionary, and, rather than adding to the df4 DataFrame a row for each date, once the lista is full, I transform the lista into a DataFrame. 我创建了一个替换字典的lista列表,而不是为每个日期添加一行df4 DataFrame,一旦lista已满,我将lista转换为DataFrame。

# Returns the largest day from  all the dates (each date has a different number of days)
def longest_series(dic):
    largest_series = 0
    for date in dic.keys():
        # get the last day's table of a specific date
        current_series = dic[date][-1][0]
        if largest_series < current_series:
            largest_series = current_series
    return largest_series


ls = longest_series(dic)
l_total_days = list(range(1, ls+1))
s_total_days = set(l_total_days)

# creating lista list, lista is similar to dic 
#The difference is that, in lista, every date has the same number of days 
#i.e. from 1 to ls, and it does not contain the dates.

# It takes 15 seconds
lista = list()
for date in dic.keys():
    present_days = list()
    presen_values = list()
    for day, val_252, _ in dic[date]:
        present_days.append(day)
        presen_values.append(val_252)

    missing_days = list(s_total_days.difference(set(present_days))) # extra days added to date
    missing_values = [None] * len(missing_days)                     # extra values added to date
    all_days_index = list(np.argsort(present_days + missing_days))  # we need to preserve the order between days and values
    all_day_values = presen_values + missing_values  
    lista.append(list(np.array(all_day_values)[all_days_index]))


# It takes 4 seconds
df = pd.DataFrame(lista, index= dic.keys(), columns=l_total_days)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM