简体   繁体   English

PySpark:如何将具有 SparseVector 类型的列的 Spark 数据帧写入 CSV 文件?

[英]PySpark: How to write a Spark dataframe having a column with type SparseVector into CSV file?

I have a spark dataframe which has one column with type spark.mllib.linalg.SparseVector:我有一个 spark 数据框,其中有一列类型为 spark.mllib.linalg.SparseVector:

1) how can I write it into a csv file? 1)如何将其写入 csv 文件?

2) how can I print all the vectors? 2)如何打印所有向量?

To write the dataframe to a csv file you can use the standard df.write.csv(output_path) .要将数据帧写入 csv 文件,您可以使用标准df.write.csv(output_path)

However, if you just use the above you are likely to get the java.lang.UnsupportedOperationException: CSV data source does not support struct<type:tinyint,size:int,indices:array<int>,values:array<double>> data type error for the column with the SparseVector type.但是,如果你只是使用上面的你很可能得到java.lang.UnsupportedOperationException: CSV data source does not support struct<type:tinyint,size:int,indices:array<int>,values:array<double>> data type SparseVector 类型的列的java.lang.UnsupportedOperationException: CSV data source does not support struct<type:tinyint,size:int,indices:array<int>,values:array<double>> data type错误。

There are two ways to print the SparseVector and avoid that error: the sparse format or the dense format.有两种方法可以打印 SparseVector 并避免该错误:稀疏格式或密集格式。

If you want to print in the dence format, you can define udf like this:如果要以 dence 格式打印,可以这样定义 udf:

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from pyspark.sql.functions import col

dense_format_udf = udf(lambda x: ','.join([str(elem) for elem in x], StringType())

df = df.withColumn('column_name', dense_format_udf(col('column_name')))

df.write.option("delimiter", "\t").csv(output_path)

The column outputs to something like this in the dense format: 1.0,0.0,5.0,0.0该列以密集格式输出如下: 1.0,0.0,5.0,0.0

If you want to print in the sparse format, you can utilize the OOB __str__ function of the SparseVector class , or be creative and define your own output format.如果您想以稀疏格式打印,您可以利用SparseVector 类的 OOB __str__函数,或者创造性地定义自己的输出格式。 Here I am going to use the OOB function.在这里,我将使用 OOB 功能。

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from pyspark.sql.functions import col

sparse_format_udf = udf(lambda x: str(x), StringType())

df = df.withColumn('column_name', sparse_format_udf(col('column_name')))

df.write.option("delimiter", "\t").csv(output_path)

The column prints to something like this in the sparse format (4,[0,2],[1.0,5.0])该列以稀疏格式(4,[0,2],[1.0,5.0])打印成类似这样的内容

Note I have tried this approach before: df = df.withColumn("column_name", col("column_name").cast("string")) but the column just prints to something like this [0,5,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@6988050,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@ec4ae6ab] which is not desirable.注意我之前尝试过这种方法: df = df.withColumn("column_name", col("column_name").cast("string"))但该列只是打印到这样的[0,5,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@6988050,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@ec4ae6ab]这是不可取的。

  1. https://github.com/databricks/spark-csv https://github.com/databricks/spark-csv
  2. df2 = df1.map(lambda row: row.yourVectorCol)

    OR df1.map(lambda row: row[1])df1.map(lambda row: row[1])

    where you either have a named column or just refer to the column by its position in the row.您要么有一个命名列,要么仅通过其在行中的位置来引用该列。

    Then, to print it, you can df2.collect()然后,要打印它,你可以df2.collect()

Without more information, this may be helpful to you, or not helpful enough to you.如果没有更多信息,这可能对您有帮助,或者对您没有足够的帮助。 Please elaborate a bit.请详细说明一下。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM