简体   繁体   English

如何编写(保存)包含向量列的 PySpark dataframe?

[英]How to write (save) PySpark dataframe containing vector column?

I'm trying to save the PySpark dataframe after transforming it using ML Pipeline.在使用 ML Pipeline 对其进行转换后,我正在尝试保存 PySpark dataframe。 But when I save it the weird error is triggered every time.但是当我保存它时,每次都会触发奇怪的错误。 Here are the columns of this dataframe:以下是此 dataframe 的列: 在此处输入图像描述

And the following error occurs when I try to write the dataframe into parquet file format:当我尝试将 dataframe 写入 parquet 文件格式时,会出现以下错误: 在此处输入图像描述

I tried to use different available winutils for Hadoop from this GitHub repository but with not much luck.我尝试从这个 GitHub 存储库中为 Hadoop 使用不同的可用winutils ,但运气不佳。 Please help me in this regard.请在这方面帮助我。 How can I save this dataframe so that I can read it in any other jupyter notebook file?如何保存此 dataframe 以便我可以在任何其他 jupyter 笔记本文件中读取它? Feel free to ask any questions.随意问任何问题。

Note: I also tried to save simple CSV Files which does not contain vector data but the same error persists.注意:我还尝试保存不包含矢量数据但相同错误仍然存在的简单 CSV 文件。

Thanks谢谢

For spark version 3.0.0 and up you can make use of pyspark.ml.functions.vector_to_array to convert the vector types to array types and then write.对于 spark 版本 3.0.0 及更高版本,您可以使用pyspark.ml.functions.vector_to_array将向量类型转换为数组类型然后写入。

from pyspark.ml.functions import vector_to_array

df = df.withColumn("dest_fact", vector_to_array("dest_fact"))\
    .withColumn("features", vector_to_array("features"))

df.write.format("parquet").mode("overwrite").save("/file/output/path")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM