[英]How to write (save) PySpark dataframe containing vector column?
I'm trying to save the PySpark dataframe after transforming it using ML Pipeline.在使用 ML Pipeline 对其进行转换后,我正在尝试保存 PySpark dataframe。 But when I save it the weird error is triggered every time.
但是当我保存它时,每次都会触发奇怪的错误。 Here are the columns of this dataframe:
以下是此 dataframe 的列:
And the following error occurs when I try to write the dataframe into parquet file format:当我尝试将 dataframe 写入 parquet 文件格式时,会出现以下错误:
I tried to use different available winutils
for Hadoop from this GitHub repository but with not much luck.我尝试从这个 GitHub 存储库中为 Hadoop 使用不同的可用
winutils
,但运气不佳。 Please help me in this regard.请在这方面帮助我。 How can I save this dataframe so that I can read it in any other jupyter notebook file?
如何保存此 dataframe 以便我可以在任何其他 jupyter 笔记本文件中读取它? Feel free to ask any questions.
随意问任何问题。
Note: I also tried to save simple CSV Files which does not contain vector data but the same error persists.注意:我还尝试保存不包含矢量数据但相同错误仍然存在的简单 CSV 文件。
Thanks谢谢
For spark version 3.0.0 and up you can make use of pyspark.ml.functions.vector_to_array
to convert the vector types to array types and then write.对于 spark 版本 3.0.0 及更高版本,您可以使用
pyspark.ml.functions.vector_to_array
将向量类型转换为数组类型然后写入。
from pyspark.ml.functions import vector_to_array
df = df.withColumn("dest_fact", vector_to_array("dest_fact"))\
.withColumn("features", vector_to_array("features"))
df.write.format("parquet").mode("overwrite").save("/file/output/path")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.