简体   繁体   English

使用多处理和 time.strftime() 创建绘图无法正常工作

[英]Creating plots with multiprocessing and time.strftime() doens't work properly

I am trying to create plots with my script running parallel using multiprocessing.我正在尝试使用多处理并行运行我的脚本来创建绘图。 I created 2 example scripts for my question here, because the actual main script with the computing part would be too long.我在这里为我的问题创建了 2 个示例脚本,因为带有计算部分的实际主脚本会太长。 In script0.py you can see the multiprocessing part where im starting the actual script1.py that does something 4 times in parallel.在 script0.py 中,您可以看到我启动实际 script1.py 的多处理部分,该部分并行执行 4 次。 In this example it just creates some random scatterplots.在这个例子中,它只是创建了一些随机散点图。

script0.py:脚本0.py:

import multiprocessing as mp
import os

def execute(process):
    os.system(f"python {process}")



if __name__ == "__main__":

    proc_num = 4
    process= []

    for _ in range(proc_num):
        process.append("script1.py")

    process_pool = mp.Pool(processes= proc_num)
    process_pool.map(execute, process)

script1.py:脚本1.py:

#just a random scatterplot, but works for my example
    import time
    import numpy as np
    import matplotlib.pyplot as plt
    import os
    
    dir_name = "stackoverflow_question"
    plot_name = time.strftime("Plot %Hh%Mm%Ss")      #note the time.strftime() function
    
    if not os.path.exists(f"{dir_name}"):
        os.mkdir(f"{dir_name}")
    
    N = 50
    x = np.random.rand(N)
    y = np.random.rand(N)
    colors = np.random.rand(N)
    
    area = (30 * np.random.rand(N))**2
    
    plt.scatter(x,y, s=area, c=colors, alpha=0.5)
    #plt.show()
    plt.savefig(f"{dir_name}/{plot_name}", dpi = 300)

The important thing is, that I am naming the plot by its creation time重要的是,我按其创建时间命名 plot

plot_name = time.strftime("Plot %Hh%Mm%Ss") plot_name = time.strftime("绘图 %Hh%Mm%Ss")

So this creates a string like "Plot 16h39m22s".所以这会创建一个类似“Plot 16h39m22s”的字符串。 So far so good... now to my actual problem, I realized that when starting the processes in parallel.到目前为止一切顺利......现在到我的实际问题,我意识到在并行启动进程时。 sometimes the plot names are the same because the time stamps created by time.strftime() are the same and so it can happen that one instance of script1.py overwrites the already created plot of another.有时 plot 名称相同,因为 time.strftime() 创建的时间戳相同,因此可能会发生 script1.py 的一个实例覆盖另一个已创建的 plot 的情况。

In my working script where I have this exact problem I'm generating a lot of data therefore i need to name my plots and CSVs accordingly to the date and time they were generated.在我遇到这个确切问题的工作脚本中,我生成了大量数据,因此我需要根据它们的生成日期和时间来命名我的图和 CSV。

I already thought of giving a variable down to script1.py when it gets called, but I don't know how to realize that since I just learned about the multiprocessing library.我已经想过在 script1.py 被调用时给它一个变量,但我不知道如何实现这一点,因为我刚刚了解了多处理库。 But this variable had to vary as well, otherwise I would run into the same problem.但是这个变量也必须改变,否则我会遇到同样的问题。

Does anybody have a better idea of how I could realize this?有人对我如何实现这一点有更好的了解吗? Thank you so much in advance.非常感谢你。

I propose these approaches:我提出这些方法:

  • Approach 1: (simple and recommended) if you can change the name, I recommend using unixtime (eg. using time.time() or time.time_ns()) instead of date or adding decimals to the seconds.方法 1:(简单且推荐)如果您可以更改名称,我建议使用 unixtime(例如使用 time.time() 或 time.time_ns())而不是日期或将小数添加到秒。 This way you would make a collision almost impossible.这样你就几乎不可能发生碰撞。
  • Approach 2: Add the process id in the filename (eg: <filename_timestamp_processid>).方法2:在文件名中添加进程ID (例如:<filename_timestamp_processid>)。 This way even if two processes write at the same time you will have the process id distinguishing the files.这样,即使两个进程同时写入,您也将拥有区分文件的进程 ID。 If you want to remove the id from the name at the end of execution read the filenames and do a merge, if there are collisions adjust the filename in the appropriate way.如果要在执行结束时从名称中删除 id,请读取文件名并进行合并,如果有冲突,请以适当的方式调整文件名。
  • Approach 3: like approach2, but instead of changing the name you create a folder named after the process id in which you put the outputs of that process.方法 3:与方法 2 类似,但不是更改名称,而是创建一个以进程 ID 命名的文件夹,在其中放置该进程的输出。 At the end of execution you merge the folders and correct any collisions.在执行结束时,您合并文件夹并更正任何冲突。
  • Approach 4: (not recommended, difficult to manage and affects performance) shared memory .方法四:(不推荐,难管理,影响性能) 共享 memory You use a variable in shared memory with the last timestamp and check that the您在共享 memory 中使用带有最后一个时间戳的变量,并检查

Welcome to the site.欢迎来到本站。 A couple ideas...几个想法...

First, you are not following the guidelines in multiprocessing module on how to use Pool .首先,您没有遵循multiprocessing模块中关于如何使用Pool的指南。 You should have it in a context manager, with(...)...您应该在上下文管理器中with(...)...

There are many examples out there.那里有很多例子。 See the warning in red in the dox:请参阅 dox 中的红色警告:

https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing.pool https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing.pool

Also, using os.system calls is a little odd/unsafe.此外,使用os.system调用有点奇怪/不安全。 Why don't you just put you plotting routine into a standard function in the same module or a different module and just import it?为什么不将绘图例程放入同一模块或不同模块中的标准 function 中并导入它? That would allow you to pass in additional info (like a good label) to the function.这将允许您向 function 传递附加信息(如一个好的标签)。 I would expect something like this where source is a datafile or external source...我希望像这样的东西,其中source是数据文件或外部源......

def make_plot(source, output_file_name, plot_label):
    # read the data source
    # make the plot
    # save it to the output path...

As far as the label is concerned, of course there is going to be overlap if you start these processes within the same "second", so you can either append the label with the process number, or some other piece of info like something from the data source, or use the same timestamp, but put the output in unique folders, as suggested in the other answer.就 label 而言,如果您在同一个“秒”内启动这些进程,当然会有重叠,因此您可以选择 append label 或类似进程号的其他信息数据源,或使用相同的时间戳,但将 output 放在唯一的文件夹中,如另一个答案中所建议的那样。

I would think something like this...我会想这样的事情......

Code:代码:

from multiprocessing import Pool
import time

def f(data, output_folder, label):
    # here data is just an integer, in yours, it would be the source of the graph data...
    val = data * data
    # the below is just example...  you could just use your folder making/saving routine...
    return f'now we can save {label} in folder {output_folder} with value: {val}'

if __name__ == '__main__':
    with Pool(5) as p:
        folders = ['data1', 'data2', 'data3']
        labels = [time.strftime("Plot %Hh%Mm%Ss")]*3
        x_s = [1, 2, 3]
        output = p.starmap(f, zip(x_s, folders, labels))
        for result in output:
            print(result)

Output: Output:

now we can save Plot 08h55m17s in folder data1 with value: 1
now we can save Plot 08h55m17s in folder data2 with value: 4
now we can save Plot 08h55m17s in folder data3 with value: 9

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM