PicklingError：无法序列化 object（仅适用于大型数据集）

Question

Context: I am using pyspark.pandas in a databricks jupyter notebook.背景：我在databricks jupyter笔记本中使用pyspark.pandas。

What I have tested: I do not get any error if:我测试过的内容：如果出现以下情况，我不会收到任何错误：

I run my code on 300 rows of data.我在 300 行数据上运行我的代码。
I simply replicate the dataset 2 times (600 rows by pd.concat).我只是将数据集复制了 2 次（pd.concat 为 600 行）。

I get an error if:如果出现以下情况，我会收到错误：

I simply replicate the dataset 10 times (3000 rows by pd.concat)我只是将数据集复制了 10 次（pd.concat 为 3000 行）

This makes me think the error is not code specific rather databricks might have an intricacy or some limitation..这让我认为该错误不是特定于代码的，而是数据块可能具有复杂性或某些限制..

Can someone explain what might be happening.有人可以解释可能发生的事情。 It's a very big repository so I haven't included the full code.这是一个非常大的存储库，所以我没有包含完整的代码。

Exact error: PicklingError: Could not serialize object: TypeError: cannot pickle '_thread.RLock' object确切错误： PicklingError：无法序列化 object：TypeError：无法腌制 '_thread.RLock' object

Full trace:完整跟踪：

/dbfs/FileStore/shared_uploads/pipeline.py in apply_criteria(self)
    408 
    409             time1 = time.perf_counter()
--> 410             self.scores_df[ [f'{field}__{criteria}' for field in fields_for_criteria[criteria] ] ]= self.rem.apply(lambda x: self.apply_criteria_across_all_fields(x,criteria),axis=1,result_type="expand")
    411             time2 = time.perf_counter()
    412             print(time2 - time1)

/databricks/spark/python/pyspark/pandas/usage_logging/__init__.py in wrapper(*args, **kwargs)
    192             start = time.perf_counter()
    193             try:
--> 194                 res = func(*args, **kwargs)
    195                 logger.log_success(
    196                     class_name, function_name, time.perf_counter() - start, signature

/databricks/spark/python/pyspark/pandas/frame.py in apply(self, func, axis, args, **kwds)
   2555                 self_applied, apply_func, return_schema, retain_index=True
   2556             )
-> 2557             sdf = self_applied._internal.to_internal_spark_frame.mapInPandas(
   2558                 lambda iterator: map(output_func, iterator), schema=return_schema
   2559             )

/databricks/spark/python/pyspark/sql/pandas/map_ops.py in mapInPandas(self, func, schema)
     79         udf = pandas_udf(
     80             func, returnType=schema, functionType=PythonEvalType.SQL_MAP_PANDAS_ITER_UDF)
---> 81         udf_column = udf(*[self[col] for col in self.columns])
     82         jdf = self._jdf.mapInPandas(udf_column._jc.expr())
     83         return DataFrame(jdf, self.sql_ctx)

/databricks/spark/python/pyspark/sql/udf.py in wrapper(*args)
    197         @functools.wraps(self.func, assigned=assignments)
    198         def wrapper(*args):
--> 199             return self(*args)
    200 
    201         wrapper.__name__ = self._name

/databricks/spark/python/pyspark/sql/udf.py in __call__(self, *cols)
    175 
    176     def __call__(self, *cols):
--> 177         judf = self._judf
    178         sc = SparkContext._active_spark_context
    179         return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))

/databricks/spark/python/pyspark/sql/udf.py in _judf(self)
    159         # and should have a minimal performance impact.
    160         if self._judf_placeholder is None:
--> 161             self._judf_placeholder = self._create_judf()
    162         return self._judf_placeholder
    163 

/databricks/spark/python/pyspark/sql/udf.py in _create_judf(self)
    168         sc = spark.sparkContext
    169 
--> 170         wrapped_func = _wrap_function(sc, self.func, self.returnType)
    171         jdt = spark._jsparkSession.parseDataType(self.returnType.json())
    172         judf = sc._jvm.org.apache.spark.sql.execution.python.UserDefinedPythonFunction(

/databricks/spark/python/pyspark/sql/udf.py in _wrap_function(sc, func, returnType)
     32 def _wrap_function(sc, func, returnType):
     33     command = (func, returnType)
---> 34     pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
     35     return sc._jvm.PythonFunction(bytearray(pickled_command), env, includes, sc.pythonExec,
     36                                   sc.pythonVer, broadcast_vars, sc._javaAccumulator)

/databricks/spark/python/pyspark/rdd.py in _prepare_for_python_RDD(sc, command)
   2848     # the serialized command will be compressed by broadcast
   2849     ser = CloudPickleSerializer()
-> 2850     pickled_command = ser.dumps(command)
   2851     if len(pickled_command) > sc._jvm.PythonUtils.getBroadcastThreshold(sc._jsc):  # Default 1M
   2852         # The broadcast will have same life cycle as created PythonRDD

/databricks/spark/python/pyspark/serializers.py in dumps(self, obj)
    481                 msg = "Could not serialize object: %s: %s" % (e.__class__.__name__, emsg)
    482             print_exec(sys.stderr)
--> 483             raise pickle.PicklingError(msg)
    484 
    485 

PicklingError: Could not serialize object: TypeError: cannot pickle '_thread.RLock' object

Answer 1

pickling would arise if something is being serialized for Spark.如果正在为 Spark 序列化某些内容，则会出现酸洗。 If it's pandas on Spark, then I think the issue may be that you're pickling a pandas DF.如果它是 Spark 上的 pandas，那么我认为问题可能是您正在酸洗 pandas DF。 it's a pandas on Spark DF.它是 Spark DF 上的 pandas。 You can't use that inside a Spark task, same as it's always been with Spark.您不能在 Spark 任务中使用它，就像在 Spark 中一直使用的一样。 Make that a pandas DF if it's just a small lookup Make it pandas instead with.toPandas() and it should work如果只是一个小查找，则将其设为 pandas DF

PicklingError：无法序列化 object（仅适用于大型数据集）

问题描述

1 个解决方案

解决方案1
0 2022-08-24 10:32:58

PicklingError：无法序列化 object（仅适用于大型数据集）

问题描述

1 个解决方案

解决方案1 0 2022-08-24 10:32:58

解决方案1
0 2022-08-24 10:32:58