简体   繁体   中英

Is matplotlib savefig threadsafe?

I have an in-house distributed computing library that we use all the time for parallel computing jobs. After the processes are partitioned, they run their data loading and computation steps and then finish with a "save" step. Usually this involved writing data to database tables.

But for a specific task, I need the output of each process to be a .png file with some data plots. There are 95 processes in total, so 95 .pngs.

Inside of my "save" step (executed on each process), I have some very simple code that makes a boxplot with matplotlib's boxplot function and some code that uses savefig to write it to a .png file that has a unique name based on the specific data used in that process.

However, I occasionally see output where it appears that two or more sets of data were written into the same output file, despite the unique names.

Does matplotlib use temporary file saves when making boxplots or saving figures? If so, does it always use the same temp file names (thus leading to over-write conflicts)? I have run my process using strace and cannot see anything that obviously looks like temp file writing from matplotlib.

How can I ensure that this will be threadsafe? I definitely want to conduct the file saving in parallel, as I am looking to expand the number of output .pngs considerably, so the option of first storing all the data and then just serially executing the plot/save portion is very undesirable.

It's impossible for me to reproduce the full parallel infrastructure we are using, but below is the function that gets called to create the plot handle, and then the function that gets called to save the plot. You should assume for the sake of the question that the thread safety has nothing to do with our distributed library. We know it's not coming from our code, which has been used for years for our multiprocessing jobs without threading issues like this (especially not for something we don't directly control, like any temp files from matplotlib).

import pandas
import numpy as np
import matplotlib.pyplot as plt

def plot_category_data(betas, category_name):
    """
    Function to organize beta data by date into vectors and pass to box plot
    code for producing a single chart of multi-period box plots.
    """
    beta_vector_list = []
    yms = np.sort(betas.yearmonth.unique())
    for ym in yms:
        beta_vector_list.append(betas[betas.yearmonth==ym].Beta.values.flatten().tolist())
    ###

    plot_output = plt.boxplot(beta_vector_list)
    axs = plt.gcf().gca()
    axs.set_xticklabels(betas.FactorDate.unique(), rotation=40, horizontalalignment='right')
    axs.set_xlabel("Date")
    axs.set_ylabel("Beta")
    axs.set_title("%s Beta to BMI Global"%(category_name))
    axs.set_ylim((-1.0, 3.0))

    return plot_output
### End plot_category_data

def save(self):
    """
    Make calls to store the plot to the desired output file.
    """
    out_file = self.output_path + "%s.png"%(self.category_name)
    fig = plt.gcf()
    fig.set_figheight(6.5)
    fig.set_figwidth(10)
    fig.savefig(out_file, bbox_inches='tight', dpi=150)
    print "Finished and stored output file %s"%(out_file)
    return None
### End save

In your two functions, you're calling plt.gcf() . I would try generating a new figure every time you plot with plt.figure() and referencing that one explicitly so you skirt the whole issue entirely.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM