如何从 PySpark 中的向量列中提取浮点数？

Question

My Spark DataFrame has data in the following format:我的 Spark DataFrame 具有以下格式的数据：

The printSchema() shows that each column is of the type vector . printSchema()显示每一列的类型为vector 。

I tried to get the values out of [ and ] using the code below (for 1 columns col1 ):我尝试使用下面的代码（对于 1 列col1 ）从[和]获取值：

from pyspark.sql.functions import udf
from pyspark.sql.types import FloatType

firstelement=udf(lambda v:float(v[0]),FloatType())
df.select(firstelement('col1')).show()

However, how can I apply it to all columns of df ?但是，如何将其应用于df所有列？

Answer 1

1. Extract first element of a single vector column: 1. 提取单个向量列的第一个元素：

To get the first element of a vector column, you can use the answer from this SO: discussion Access element of a vector in a Spark DataFrame (Logistic Regression probability vector)要获取向量列的第一个元素，您可以使用此 SO 中的答案：讨论访问 Spark DataFrame 中向量的元素（逻辑回归概率向量）

Here's a reproducible example:这是一个可重现的示例：

>>> from pyspark.sql import functions as f
>>> from pyspark.sql.types import FloatType
>>> df = spark.createDataFrame([{"col1": [0.2], "col2": [0.25]},
                                {"col1": [0.45], "col2":[0.85]}])
>>> df.show()
+------+------+
|  col1|  col2|
+------+------+
| [0.2]|[0.25]|
|[0.45]|[0.85]|
+------+------+

>>> firstelement=f.udf(lambda v:float(v[0]),FloatType())
>>> df.withColumn("col1", firstelement("col1")).show()
+----+------+
|col1|  col2|
+----+------+
| 0.2|[0.25]|
|0.45|[0.85]|
+----+------+

2. Extract first element of multiple vector columns: 2.提取多个向量列的第一个元素：

To generalize the above solution to multiple columns, apply a for loop .要将上述解决方案推广到多列，请应用for loop 。 Here's an example:下面是一个例子：

>>> from pyspark.sql import functions as f
>>> from pyspark.sql.types import FloatType

>>> df = spark.createDataFrame([{"col1": [0.2], "col2": [0.25]},
                                {"col1": [0.45], "col2":[0.85]}])
>>> df.show()
+------+------+
|  col1|  col2|
+------+------+
| [0.2]|[0.25]|
|[0.45]|[0.85]|
+------+------+

>>> firstelement=f.udf(lambda v:float(v[0]),FloatType())
>>> df = df.select([firstelement(c).alias(c) for c in df.columns])
>>> df.show()
+----+----+
|col1|col2|
+----+----+
| 0.2|0.25|
|0.45|0.85|
+----+----+

Answer 2

As I understand your problem, you do not required to use UDF to change Vector into normal Float Type.据我了解您的问题，您不需要使用UDF将 Vector 更改为普通的 Float 类型。 Use pyspark predefined function concat_ws for it.为它使用pyspark预定义函数concat_ws 。

>>> from pyspark.sql.functions import *
>>> df.show()
+------+
|   num|
+------+
| [211]|
|[3412]|
| [121]|
| [121]|
|  [34]|
|[1441]|
+------+

>>> df.printSchema()
root
 |-- num: array (nullable = true)
 |    |-- element: string (containsNull = true)

>>> df.withColumn("num", concat_ws("", col("num"))).show()
+----+
| num|
+----+
| 211|
|3412|
| 121|
| 121|
|  34|
|1441|
+----+

如何从 PySpark 中的向量列中提取浮点数？

问题描述

2 个解决方案

解决方案1
1 已采纳 2020-02-19 09:33:26

1. Extract first element of a single vector column: 1. 提取单个向量列的第一个元素：

2. Extract first element of multiple vector columns: 2.提取多个向量列的第一个元素：

解决方案2
0 2020-02-19 08:56:32

如何从 PySpark 中的向量列中提取浮点数？

问题描述

2 个解决方案

解决方案1 1 已采纳 2020-02-19 09:33:26

1. Extract first element of a single vector column: 1. 提取单个向量列的第一个元素：

2. Extract first element of multiple vector columns: 2.提取多个向量列的第一个元素：

解决方案2 0 2020-02-19 08:56:32

解决方案1
1 已采纳 2020-02-19 09:33:26

解决方案2
0 2020-02-19 08:56:32