如何编写（保存）包含向量列的 PySpark dataframe？

Question

I'm trying to save the PySpark dataframe after transforming it using ML Pipeline.在使用 ML Pipeline 对其进行转换后，我正在尝试保存 PySpark dataframe。 But when I save it the weird error is triggered every time.但是当我保存它时，每次都会触发奇怪的错误。 Here are the columns of this dataframe:以下是此 dataframe 的列：

And the following error occurs when I try to write the dataframe into parquet file format:当我尝试将 dataframe 写入 parquet 文件格式时，会出现以下错误：

I tried to use different available winutils for Hadoop from this GitHub repository but with not much luck.我尝试从这个 GitHub 存储库中为 Hadoop 使用不同的可用winutils ，但运气不佳。 Please help me in this regard.请在这方面帮助我。 How can I save this dataframe so that I can read it in any other jupyter notebook file?如何保存此 dataframe 以便我可以在任何其他 jupyter 笔记本文件中读取它？ Feel free to ask any questions.随意问任何问题。

Note: I also tried to save simple CSV Files which does not contain vector data but the same error persists.注意：我还尝试保存不包含矢量数据但相同错误仍然存在的简单 CSV 文件。

Thanks谢谢

Answer 1

For spark version 3.0.0 and up you can make use of pyspark.ml.functions.vector_to_array to convert the vector types to array types and then write.对于 spark 版本 3.0.0 及更高版本，您可以使用pyspark.ml.functions.vector_to_array将向量类型转换为数组类型然后写入。

from pyspark.ml.functions import vector_to_array

df = df.withColumn("dest_fact", vector_to_array("dest_fact"))\
    .withColumn("features", vector_to_array("features"))

df.write.format("parquet").mode("overwrite").save("/file/output/path")

如何编写（保存）包含向量列的 PySpark dataframe？

问题描述

1 个解决方案

解决方案1
0 2022-08-06 09:35:34

如何编写（保存）包含向量列的 PySpark dataframe？

问题描述

1 个解决方案

解决方案1 0 2022-08-06 09:35:34

解决方案1
0 2022-08-06 09:35:34