使用HappyBase连接池的PySpark dataframe.foreach（）返回'TypeError：无法pickle thread.lock对象'

Question

I have a PySpark job that updates some objects in HBase (Spark v1.6.0; happybase v0.9). 我有一个PySpark作业，用于更新HBase中的一些对象（Spark v1.6.0; happybase v0.9）。

It sort-of works if I open/close an HBase connection for each row: 如果我打开/关闭每行的HBase连接，它会有效：

def process_row(row):
    conn = happybase.Connection(host=[hbase_master])
    # update HBase record with data from row
    conn.close()

my_dataframe.foreach(process_row)

After a few thousand upserts, we start to see errors like this: 几千次upserts后，我们开始看到这样的错误：

 TTransportException: Could not connect to [hbase_master]:9090

Obviously, it's inefficient to open/close a connection for each upsert. 显然，为每个upsert打开/关闭连接效率很低。 This function is really just a placeholder for a proper solution. 这个函数实际上只是一个适当解决方案的占位符。

I then tried to create a version of the process_row function that uses a connection pool: 然后我尝试创建一个使用连接池的process_row函数版本：

pool = happybase.ConnectionPool(size=20, host=[hbase_master])

def process_row(row):
    with pool.connection() as conn:
        # update HBase record with data from row

For some reason, the connection pool version of this function returns an error (see complete error message ): 由于某种原因，此函数的连接池版本返回错误（请参阅完整的错误消息）：

 TypeError: can't pickle thread.lock objects

Can you see what I'm doing wrong? 你能看出我做错了什么吗？

Update 更新

I saw this post and suspect I'm experiencing the same issue: Spark attempts to serialize the pool object and distribute it to each of the executors, but this connection pool object cannot be shared across multiple executors. 我看到这篇文章并怀疑我遇到了同样的问题：Spark尝试序列化pool对象并将其分发给每个执行程序，但是这个连接池对象不能在多个执行程序之间共享。

It sounds like I need to split the dataset into partitions, and use one connection per partition (see design patterns for using foreachrdd ). 听起来我需要将数据集拆分为分区，并且每个分区使用一个连接（请参阅使用foreachrdd的设计模式）。 I tried this, based on an example in the documentation: 我根据文档中的示例尝试了这个：

def persist_to_hbase(dataframe_partition):
    hbase_connection = happybase.Connection(host=[hbase_master])
    for row in dataframe_partition:
        # persist data
    hbase_connection.close()

my_dataframe.foreachPartition(lambda dataframe_partition: persist_to_hbase(dataframe_partition))

Unfortunately, it still returns a "can't pickle thread.lock objects" error. 不幸的是，它仍然返回“无法解决thread.lock对象”错误。

Answer 1

down the line happybase connections are just tcp connections so they cannot be shared between processes. 在线下，happybase连接只是tcp连接，因此它们不能在进程之间共享。 a connection pool is primarily useful for multi-threaded applications and also proves useful for single-threaded applications that can use the pool as a global "connection factory" with connection reuse, which may simplify code because no "connection" objects need to be passed around. 连接池主要用于多线程应用程序，并且对于可以将池用作连接重用的全局“连接工厂”的单线程应用程序也很有用，这可以简化代码，因为不需要传递“连接”对象周围。 it also makes error recovery is a bit easier. 它还使错误恢复更容易一些。

in any case a pool (which is just a group of connections) cannot be shared between processes. 在任何情况下，进程之间都不能共享池（只是一组连接）。 trying to serialise it does not make sense for that reason. 试图序列化它是没有意义的。 (pools use locks which causes serialisation to fail but that is just a symptom.) （池使用锁定会导致序列化失败，但这只是一种症状。）

perhaps you can use a helper that conditionally creates a pool (or connection) and stores it as a module-local variable, instead of instantiating it upon import, eg 也许你可以使用一个有条件地创建一个池（或连接）并将其存储为模块局部变量的助手，而不是在导入时将其实例化，例如

_pool = None

def get_pool():
    global _pool
    if _pool is None:
        _pool = happybase.ConnectionPool(size=1, host=[hbase_master])
    return pool

def process(...)
    with get_pool().connection() as connection:
        connection.table(...).put(...)

this instantiates the pool/connection upon first use instead of on import time. 这会在首次使用时实例化池/连接，而不是导入时。

使用HappyBase连接池的PySpark dataframe.foreach（）返回'TypeError：无法pickle thread.lock对象'

问题描述

Update 更新

1 个解决方案

解决方案1
1 2016-12-14 19:12:52

使用HappyBase连接池的PySpark dataframe.foreach（）返回'TypeError：无法pickle thread.lock对象'

问题描述

Update 更新

1 个解决方案

解决方案1 1 2016-12-14 19:12:52

解决方案1
1 2016-12-14 19:12:52