简体   繁体   English

PicklingError:无法序列化 object(仅适用于大型数据集)

[英]PicklingError: Could not serialize object (happens only for large datasets)

Context: I am using pyspark.pandas in a databricks jupyter notebook.背景:我在databricks jupyter笔记本中使用pyspark.pandas。

What I have tested: I do not get any error if:我测试过的内容:如果出现以下情况,我不会收到任何错误:

  • I run my code on 300 rows of data.我在 300 行数据上运行我的代码。

  • I simply replicate the dataset 2 times (600 rows by pd.concat).我只是将数据集复制了 2 次(pd.concat 为 600 行)。

I get an error if:如果出现以下情况,我会收到错误:

  • I simply replicate the dataset 10 times (3000 rows by pd.concat)我只是将数据集复制了 10 次(pd.concat 为 3000 行)

This makes me think the error is not code specific rather databricks might have an intricacy or some limitation..这让我认为该错误不是特定于代码的,而是数据块可能具有复杂性或某些限制..

Can someone explain what might be happening.有人可以解释可能发生的事情。 It's a very big repository so I haven't included the full code.这是一个非常大的存储库,所以我没有包含完整的代码。

Exact error: PicklingError: Could not serialize object: TypeError: cannot pickle '_thread.RLock' object确切错误: PicklingError:无法序列化 object:TypeError:无法腌制 '_thread.RLock' object

Full trace:完整跟踪:

/dbfs/FileStore/shared_uploads/pipeline.py in apply_criteria(self)
    408 
    409             time1 = time.perf_counter()
--> 410             self.scores_df[ [f'{field}__{criteria}' for field in fields_for_criteria[criteria] ] ]= self.rem.apply(lambda x: self.apply_criteria_across_all_fields(x,criteria),axis=1,result_type="expand")
    411             time2 = time.perf_counter()
    412             print(time2 - time1)

/databricks/spark/python/pyspark/pandas/usage_logging/__init__.py in wrapper(*args, **kwargs)
    192             start = time.perf_counter()
    193             try:
--> 194                 res = func(*args, **kwargs)
    195                 logger.log_success(
    196                     class_name, function_name, time.perf_counter() - start, signature

/databricks/spark/python/pyspark/pandas/frame.py in apply(self, func, axis, args, **kwds)
   2555                 self_applied, apply_func, return_schema, retain_index=True
   2556             )
-> 2557             sdf = self_applied._internal.to_internal_spark_frame.mapInPandas(
   2558                 lambda iterator: map(output_func, iterator), schema=return_schema
   2559             )

/databricks/spark/python/pyspark/sql/pandas/map_ops.py in mapInPandas(self, func, schema)
     79         udf = pandas_udf(
     80             func, returnType=schema, functionType=PythonEvalType.SQL_MAP_PANDAS_ITER_UDF)
---> 81         udf_column = udf(*[self[col] for col in self.columns])
     82         jdf = self._jdf.mapInPandas(udf_column._jc.expr())
     83         return DataFrame(jdf, self.sql_ctx)

/databricks/spark/python/pyspark/sql/udf.py in wrapper(*args)
    197         @functools.wraps(self.func, assigned=assignments)
    198         def wrapper(*args):
--> 199             return self(*args)
    200 
    201         wrapper.__name__ = self._name

/databricks/spark/python/pyspark/sql/udf.py in __call__(self, *cols)
    175 
    176     def __call__(self, *cols):
--> 177         judf = self._judf
    178         sc = SparkContext._active_spark_context
    179         return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))

/databricks/spark/python/pyspark/sql/udf.py in _judf(self)
    159         # and should have a minimal performance impact.
    160         if self._judf_placeholder is None:
--> 161             self._judf_placeholder = self._create_judf()
    162         return self._judf_placeholder
    163 

/databricks/spark/python/pyspark/sql/udf.py in _create_judf(self)
    168         sc = spark.sparkContext
    169 
--> 170         wrapped_func = _wrap_function(sc, self.func, self.returnType)
    171         jdt = spark._jsparkSession.parseDataType(self.returnType.json())
    172         judf = sc._jvm.org.apache.spark.sql.execution.python.UserDefinedPythonFunction(

/databricks/spark/python/pyspark/sql/udf.py in _wrap_function(sc, func, returnType)
     32 def _wrap_function(sc, func, returnType):
     33     command = (func, returnType)
---> 34     pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
     35     return sc._jvm.PythonFunction(bytearray(pickled_command), env, includes, sc.pythonExec,
     36                                   sc.pythonVer, broadcast_vars, sc._javaAccumulator)

/databricks/spark/python/pyspark/rdd.py in _prepare_for_python_RDD(sc, command)
   2848     # the serialized command will be compressed by broadcast
   2849     ser = CloudPickleSerializer()
-> 2850     pickled_command = ser.dumps(command)
   2851     if len(pickled_command) > sc._jvm.PythonUtils.getBroadcastThreshold(sc._jsc):  # Default 1M
   2852         # The broadcast will have same life cycle as created PythonRDD

/databricks/spark/python/pyspark/serializers.py in dumps(self, obj)
    481                 msg = "Could not serialize object: %s: %s" % (e.__class__.__name__, emsg)
    482             print_exec(sys.stderr)
--> 483             raise pickle.PicklingError(msg)
    484 
    485 

PicklingError: Could not serialize object: TypeError: cannot pickle '_thread.RLock' object

pickling would arise if something is being serialized for Spark.如果正在为 Spark 序列化某些内容,则会出现酸洗。 If it's pandas on Spark, then I think the issue may be that you're pickling a pandas DF.如果它是 Spark 上的 pandas,那么我认为问题可能是您正在酸洗 pandas DF。 it's a pandas on Spark DF.它是 Spark DF 上的 pandas。 You can't use that inside a Spark task, same as it's always been with Spark.您不能在 Spark 任务中使用它,就像在 Spark 中一直使用的一样。 Make that a pandas DF if it's just a small lookup Make it pandas instead with.toPandas() and it should work如果只是一个小查找,则将其设为 pandas DF

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 cPickle.PicklingError:无法序列化 object:NotImplementedError - cPickle.PicklingError: Could not serialize object: NotImplementedError PicklingError:无法序列化 object:IndexError:元组索引超出范围 - PicklingError: Could not serialize object: IndexError: tuple index out of range PicklingError: 无法序列化 object: TypeError: 无法序列化 '_io.BufferedReader' object - PicklingError: Could not serialize object: TypeError: cannot serialize '_io.BufferedReader' object _pickle.PicklingError:无法序列化对象:TypeError:无法腌制_thread.RLock对象 - _pickle.PicklingError: Could not serialize object: TypeError: can't pickle _thread.RLock objects 从外部 scope 访问 pyspark 用户定义的 function 中的对象,避免 PicklingError: Could not serialize object - access objects in pyspark user-defined function from outer scope, avoid PicklingError: Could not serialize object PySpark:PicklingError:无法序列化对象:TypeError:无法pickle CompiledFFI对象 - PySpark: PicklingError: Could not serialize object: TypeError: can't pickle CompiledFFI objects PySpark Pipeline.fit(df) 方法给出 PicklingError: Could not serialize object: ValueError: substring not found while using Elephas - PySpark Pipeline.fit(df) method give PicklingError: Could not serialize object: ValueError: substring not found while using Elephas msgpack 无法序列化大型 numpy ndarrays - msgpack could not serialize large numpy ndarrays 错误:_pickle.PicklingError:无法腌制任务以将其发送给工人。 NotImplementedError: object 代理必须定义 __reduce_ex__() - Error: _pickle.PicklingError: Could not pickle the task to send it to the workers. NotImplementedError: object proxy must define __reduce_ex__() 从熊猫网站读取大型数据集仅返回1.000行? - Reading large datasets from website in pandas returns only 1.000 lines?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM