multiprocessing / psycopg2 TypeError：无法pickle _thread.RLock对象

Question

I followed the below code in order to implement a parallel select query on a postgres database: 我按照下面的代码在postgres数据库上实现并行选择查询：

https://tech.geoblink.com/2017/07/06/parallelizing-queries-in-postgresql-with-python/ https://tech.geoblink.com/2017/07/06/parallelizing-queries-in-postgresql-with-python/

My basic problem is that I have ~6k queries that need to be executed, and I am trying to optimise the execution of these select queries. 我的基本问题是我有大约6k个需要执行的查询，我正在尝试优化这些选择查询的执行。 Initially it was a single query with the where id in (...) contained all 6k predicate IDs but I ran into issues with the query using up > 4GB of RAM on the machine it ran on, so I decided to split it out into 6k individual queries which when synchronously keeps a steady memory usage. 最初它是一个单独的查询， where id in (...)包含所有6k谓词ID但我在运行它的机器上使用> 4GB的RAM遇到了查询问题，所以我决定将其拆分为6k个别查询，当同步保持稳定的内存使用时。 However it takes a lot longer to run time wise, which is less of an issue for my use case. 然而，运行时间需要更长的时间，这对我的用例来说不是一个问题。 Even so I am trying to reduce the time as much as possible. 即便如此，我也在尽量减少时间。

This is what my code looks like: 这就是我的代码：

class PostgresConnector(object):
    def __init__(self, db_url):
        self.db_url = db_url
        self.engine = self.init_connection()
        self.pool = self.init_pool()

    def init_pool(self):
        CPUS = multiprocessing.cpu_count()
        return multiprocessing.Pool(CPUS)

    def init_connection(self):
        LOGGER.info('Creating Postgres engine')
        return create_engine(self.db_url)

    def run_parallel_queries(self, queries):
        results = []
        try:
            for i in self.pool.imap_unordered(self.execute_parallel_query, queries):
                results.append(i)
        except Exception as exception:
            LOGGER.error('Error whilst executing %s queries in parallel: %s', len(queries), exception)
            raise
        finally:
            self.pool.close()
            self.pool.join()

        LOGGER.info('Parallel query ran producing %s sets of results of type: %s', len(results), type(results))

        return list(chain.from_iterable(results))

    def execute_parallel_query(self, query):
        con = psycopg2.connect(self.db_url)
        cur = con.cursor()
        cur.execute(query)
        records = cur.fetchall()
        con.close()

        return list(records)

However whenever this runs, I get the following error: 但是无论何时运行，我都会收到以下错误：

TypeError: can't pickle _thread.RLock objects

I've read lots of similar questions regarding the use of multiprocessing and pickleable objects but I cant for the life of me figure out what I am doing wrong. 我已经阅读了许多关于多处理和可拾取对象的使用的类似问题，但我不知道我的生活中弄清楚我做错了什么。

The pool is generally one per process (which I believe is the best practise) but shared per instance of the connector class so that its not creating a pool for each use of the parallel_query method. 该池通常是每个进程一个（我认为这是最佳实践），但是每个连接器类的实例共享，以便它不会为每次使用parallel_query方法创建池。

The top answer to a similar question: 对类似问题的最佳答案：

Accessing a MySQL connection pool from Python multiprocessing 从Python多处理访问MySQL连接池

Shows an almost identical implementation to my own, except using MySql instead of Postgres. 显示与我自己几乎相同的实现，除了使用MySql而不是Postgres。

Am I doing something wrong? 难道我做错了什么？

Thanks! 谢谢！

EDIT: 编辑：

I've found this answer: 我找到了这个答案：

Python Postgres psycopg2 ThreadedConnectionPool exhausted Python Postgres psycopg2 ThreadedConnectionPool用尽了

which is incredibly detailed and looks as though I have misunderstood what multiprocessing.Pool vs a connection pool such as ThreadedConnectionPool gives me. 这非常详细，看起来好像我误解了multiprocessing.Pool与连接池如ThreadedConnectionPool给了我什么。 However in the first link it doesn't mention needing any connection pools etc. This solution seems good but seems A LOT of code for what I think is a fairly simple problem? 但是在第一个链接中它没有提到需要任何连接池等。这个解决方案似乎很好，但似乎有很多代码我认为是一个相当简单的问题？

EDIT 2: 编辑2：

So the above link solves another problem, which I would have likely run into anyway so I'm glad I found that, but it doesnt solve the initial issue of not being able to use imap_unordered down to the pickling error. 所以上面的链接解决了另一个问题，我可能会遇到这种问题所以我很高兴我发现了，但它并没有解决最初的问题，即无法使用imap_unordered来解决酸洗错误。 Very frustrating. 很沮丧。

Lastly, I think its probably worth noting that this runs in Heroku, on a worker dyno, using Redis rq for scheduling, background tasks etc and a hosted instance of Postgres as the database. 最后，我认为值得注意的是，它在Heroku中运行，在工作器dyno上运行，使用Redis rq进行调度，后台任务等以及Postgres的托管实例作为数据库。

Answer 1

To put it simply, postgres connection and sqlalchemy connection pool is thread safe, however they are not fork-safe. 简单来说，postgres连接和sqlalchemy连接池是线程安全的，但它们不是fork安全的。

If you want to use multiprocessing, you should initialize the engine in each child processes after the fork. 如果要使用多处理，则应在fork之后初始化每个子进程中的引擎。

You should use multithreading instead if you want to share engines. 如果要共享引擎，则应使用多线程。

Refer to Thread and process safety in psycopg2 documentation : 请参阅psycopg2文档中的线程和过程安全性：

libpq connections shouldn't be used by a forked processes, so when using a module such as multiprocessing or a forking web deploy method such as FastCGI make sure to create the connections after the fork. 分叉进程不应使用libpq连接，因此当使用诸如多处理之类的模块或分支Web部署方法（如FastCGI）时，请确保在fork之后创建连接。

If you are using multiprocessing.Pool, there is a keyword argument initializer which can be used to run code once on each child process. 如果您使用的是multiprocessing.Pool，则有一个关键字参数初始值设定项，可用于在每个子进程上运行一次代码。 Try this: 试试这个：

class PostgresConnector(object):
    def __init__(self, db_url):
        self.db_url = db_url
        self.pool = self.init_pool()

    def init_pool(self):
        CPUS = multiprocessing.cpu_count()
        return multiprocessing.Pool(CPUS, initializer=self.init_connection(self.db_url))

    @classmethod
    def init_connection(cls, db_url):
        def _init_connection():
            LOGGER.info('Creating Postgres engine')
            cls.engine = create_engine(db_url)
        return _init_connection

    def run_parallel_queries(self, queries):
        results = []
        try:
            for i in self.pool.imap_unordered(self.execute_parallel_query, queries):
                results.append(i)
        except Exception as exception:
            LOGGER.error('Error whilst executing %s queries in parallel: %s', len(queries), exception)
            raise
        finally:
            pass
            #self.pool.close()
            #self.pool.join()

        LOGGER.info('Parallel query ran producing %s sets of results of type: %s', len(results), type(results))

        return list(chain.from_iterable(results))

    def execute_parallel_query(self, query):
        with self.engine.connect() as conn:
            with conn.begin():
                result = conn.execute(query)
                return result.fetchall()

    def __getstate__(self):
        # this is a hack, if you want to remove this method, you should
        # remove self.pool and just pass pool explicitly
        self_dict = self.__dict__.copy()
        del self_dict['pool']
        return self_dict

Now, to address the XY problem. 现在，解决XY问题。

Initially it was a single query with the where id in (...) contained all 6k predicate IDs but I ran into issues with the query using up > 4GB of RAM on the machine it ran on, so I decided to split it out into 6k individual queries which when synchronously keeps a steady memory usage. 最初它是一个单独的查询，其中where（）中的id包含所有6k谓词ID但我在运行它的机器上使用> 4GB的RAM遇到了查询问题，所以我决定将其拆分为6k个别查询，当同步保持稳定的内存使用时。

What you may want to do instead is one of these options: 您可能想要做的是以下选项之一：

write a subquery that generates all 6000 IDs and use the subquery in your original bulk query. 编写一个子查询，生成所有6000个ID，并在原始批量查询中使用子查询。
as above, but write the subquery as a CTE 如上所述，但将子查询写为CTE
if your ID list comes from an external source (ie not from the database), then you can create a temporary table containing the 6000 IDs and then run your original bulk query against the temporary table 如果您的ID列表来自外部源（即不是来自数据库），那么您可以创建一个包含6000个ID的临时表，然后针对临时表运行原始批量查询

However, if you insist on running 6000 IDs through python, then the fastest query is likely neither to do all 6000 IDs in one go (which will run out of memory) nor to run 6000 individual queries. 但是，如果你坚持通过python运行6000个ID，那么最快的查询可能既不能一次完成所有6000个ID（这将耗尽内存），也不会运行6000个单独的查询。 Instead, you may want to try to chunk the queries. 相反，您可能希望尝试对查询进行分块。 Send 500 IDs at once for example. 例如，一次发送500个ID。 You will have to experiment with the chunk size to determine the largest number of IDs you can send at one time while still comfortably within your memory budget. 您将不得不尝试使用块大小来确定一次可以发送的最大ID数，同时仍然可以轻松地在内存预算范围内。

multiprocessing / psycopg2 TypeError：无法pickle _thread.RLock对象

问题描述

1 个解决方案

解决方案1
4 已采纳 2018-10-08 13:33:53

multiprocessing / psycopg2 TypeError：无法pickle _thread.RLock对象

问题描述

1 个解决方案

解决方案1 4 已采纳 2018-10-08 13:33:53

解决方案1
4 已采纳 2018-10-08 13:33:53