[英]PicklingError: Could not serialize object (happens only for large datasets)
Context: I am using pyspark.pandas in a databricks jupyter notebook.背景:我在databricks jupyter笔记本中使用pyspark.pandas。
What I have tested: I do not get any error if:我测试过的内容:如果出现以下情况,我不会收到任何错误:
I run my code on 300 rows of data.我在 300 行数据上运行我的代码。
I simply replicate the dataset 2 times (600 rows by pd.concat).我只是将数据集复制了 2 次(pd.concat 为 600 行)。
I get an error if:如果出现以下情况,我会收到错误:
This makes me think the error is not code specific rather databricks might have an intricacy or some limitation..这让我认为该错误不是特定于代码的,而是数据块可能具有复杂性或某些限制..
Can someone explain what might be happening.有人可以解释可能发生的事情。 It's a very big repository so I haven't included the full code.这是一个非常大的存储库,所以我没有包含完整的代码。
Exact error: PicklingError: Could not serialize object: TypeError: cannot pickle '_thread.RLock' object确切错误: PicklingError:无法序列化 object:TypeError:无法腌制 '_thread.RLock' object
Full trace:完整跟踪:
/dbfs/FileStore/shared_uploads/pipeline.py in apply_criteria(self)
408
409 time1 = time.perf_counter()
--> 410 self.scores_df[ [f'{field}__{criteria}' for field in fields_for_criteria[criteria] ] ]= self.rem.apply(lambda x: self.apply_criteria_across_all_fields(x,criteria),axis=1,result_type="expand")
411 time2 = time.perf_counter()
412 print(time2 - time1)
/databricks/spark/python/pyspark/pandas/usage_logging/__init__.py in wrapper(*args, **kwargs)
192 start = time.perf_counter()
193 try:
--> 194 res = func(*args, **kwargs)
195 logger.log_success(
196 class_name, function_name, time.perf_counter() - start, signature
/databricks/spark/python/pyspark/pandas/frame.py in apply(self, func, axis, args, **kwds)
2555 self_applied, apply_func, return_schema, retain_index=True
2556 )
-> 2557 sdf = self_applied._internal.to_internal_spark_frame.mapInPandas(
2558 lambda iterator: map(output_func, iterator), schema=return_schema
2559 )
/databricks/spark/python/pyspark/sql/pandas/map_ops.py in mapInPandas(self, func, schema)
79 udf = pandas_udf(
80 func, returnType=schema, functionType=PythonEvalType.SQL_MAP_PANDAS_ITER_UDF)
---> 81 udf_column = udf(*[self[col] for col in self.columns])
82 jdf = self._jdf.mapInPandas(udf_column._jc.expr())
83 return DataFrame(jdf, self.sql_ctx)
/databricks/spark/python/pyspark/sql/udf.py in wrapper(*args)
197 @functools.wraps(self.func, assigned=assignments)
198 def wrapper(*args):
--> 199 return self(*args)
200
201 wrapper.__name__ = self._name
/databricks/spark/python/pyspark/sql/udf.py in __call__(self, *cols)
175
176 def __call__(self, *cols):
--> 177 judf = self._judf
178 sc = SparkContext._active_spark_context
179 return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))
/databricks/spark/python/pyspark/sql/udf.py in _judf(self)
159 # and should have a minimal performance impact.
160 if self._judf_placeholder is None:
--> 161 self._judf_placeholder = self._create_judf()
162 return self._judf_placeholder
163
/databricks/spark/python/pyspark/sql/udf.py in _create_judf(self)
168 sc = spark.sparkContext
169
--> 170 wrapped_func = _wrap_function(sc, self.func, self.returnType)
171 jdt = spark._jsparkSession.parseDataType(self.returnType.json())
172 judf = sc._jvm.org.apache.spark.sql.execution.python.UserDefinedPythonFunction(
/databricks/spark/python/pyspark/sql/udf.py in _wrap_function(sc, func, returnType)
32 def _wrap_function(sc, func, returnType):
33 command = (func, returnType)
---> 34 pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
35 return sc._jvm.PythonFunction(bytearray(pickled_command), env, includes, sc.pythonExec,
36 sc.pythonVer, broadcast_vars, sc._javaAccumulator)
/databricks/spark/python/pyspark/rdd.py in _prepare_for_python_RDD(sc, command)
2848 # the serialized command will be compressed by broadcast
2849 ser = CloudPickleSerializer()
-> 2850 pickled_command = ser.dumps(command)
2851 if len(pickled_command) > sc._jvm.PythonUtils.getBroadcastThreshold(sc._jsc): # Default 1M
2852 # The broadcast will have same life cycle as created PythonRDD
/databricks/spark/python/pyspark/serializers.py in dumps(self, obj)
481 msg = "Could not serialize object: %s: %s" % (e.__class__.__name__, emsg)
482 print_exec(sys.stderr)
--> 483 raise pickle.PicklingError(msg)
484
485
PicklingError: Could not serialize object: TypeError: cannot pickle '_thread.RLock' object
pickling would arise if something is being serialized for Spark.如果正在为 Spark 序列化某些内容,则会出现酸洗。 If it's pandas on Spark, then I think the issue may be that you're pickling a pandas DF.如果它是 Spark 上的 pandas,那么我认为问题可能是您正在酸洗 pandas DF。 it's a pandas on Spark DF.它是 Spark DF 上的 pandas。 You can't use that inside a Spark task, same as it's always been with Spark.您不能在 Spark 任务中使用它,就像在 Spark 中一直使用的一样。 Make that a pandas DF if it's just a small lookup Make it pandas instead with.toPandas() and it should work如果只是一个小查找,则将其设为 pandas DF
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.