简体   繁体   English

如何在 spark DataFrame 中将多个浮点列连接到一个 ArrayType(FloatType()) 中?

[英]How can I concat several float columns into one ArrayType(FloatType()) in spark DataFrame?

I have a spark DataFrame with many float columns after reading in a CSV file.读入 CSV 文件后,我有一个带有许多浮点列的 spark DataFrame

I want to combine all the float columns into one ArrayType(FloatType()) .我想将所有浮点列组合成一个ArrayType(FloatType())

Any ideas how to do this with PySpark (or Scala)?任何想法如何使用 PySpark(或 Scala)做到这一点?

If you know all the float column name.如果您知道所有浮点列名称。 You can try this (scala)你可以试试这个(scala)

val names = Seq("float_col1", "float_col2","float_col3"...."float_col10");
df.withColumn("combined", array(names.map(frame(_)):_*))

Here is another version in Scala:这是 Scala 中的另一个版本:

data.printSchema

root
 |-- Int_Col1: integer (nullable = false)
 |-- Str_Col1: string (nullable = true)
 |-- Float_Col1: float (nullable = false)
 |-- Float_Col2: float (nullable = false)
 |-- Str_Col2: string (nullable = true)
 |-- Float_Col3: float (nullable = false)

data.show()

+--------+--------+----------+----------+--------+----------+
|Int_Col1|Str_Col1|Float_Col1|Float_Col2|Str_Col2|Float_Col3|
+--------+--------+----------+----------+--------+----------+
|       1|     ABC|     10.99|     20.99|       a|      9.99|
|       2|     XYZ|  999.1343|    9858.1|       b|    488.99|
+--------+--------+----------+----------+--------+----------+

Add a new array<float> field to concatenate all float values.添加一个新的array<float>字段以连接所有float值。

val df = data.withColumn("Float_Arr_Col",array().cast("array<float>"))

Then filter the datatype that is needed and concatenate the float columns using foldLeft然后过滤所需的数据类型并使用foldLeft连接浮点列

df.dtypes
.collect{ case (dn, dt) if dt.startsWith("FloatType") => dn }
.foldLeft(df)((accDF, c) => accDF.withColumn("Float_Arr_Col", 
                                             array_union(col("Float_Arr_Col"),array(col(c)))))
.show(false)

Output:输出:

+--------+--------+----------+----------+--------+----------+--------------------------+
|Int_Col1|Str_Col1|Float_Col1|Float_Col2|Str_Col2|Float_Col3|Float_Arr_Col             |
+--------+--------+----------+----------+--------+----------+--------------------------+
|1       |ABC     |10.99     |20.99     |a       |9.99      |[10.99, 20.99, 9.99]      |
|2       |XYZ     |999.1343  |9858.1    |b       |488.99    |[999.1343, 9858.1, 488.99]|
+--------+--------+----------+----------+--------+----------+--------------------------+

Hope this helps!希望这可以帮助!

Found the solution.找到了解决办法。 Very straightforward, but hard to find.非常简单,但很难找到。

float_cols = ['_c1', '_c2', '_c3', '_c4', '_c5', '_c6', '_c7', '_c8', '_c9', '_c10']

df.withColumn('combined', array([col(c) for c in float_cols]))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM