将数组列转换为 PySpark 数据帧中的结构数组

Question

I have a Dataframe containing 3 columns我有一个包含 3 列的数据框

| str1      | array_of_str1        | array_of_str2  |
+-----------+----------------------+----------------+
| John      | [Size, Color]        | [M, Black]     |
| Tom       | [Size, Color]        | [L, White]     |
| Matteo    | [Size, Color]        | [M, Red]       |

I want to add the Array column that contains the 3 columns in a struct type我想添加包含结构类型中 3 列的 Array 列

| str1      | array_of_str1        | array_of_str2  | concat_result                                 |
+-----------+----------------------+----------------+-----------------------------------------------+
| John      | [Size, Color]        | [M, Black]     | [[[John, Size , M], [John, Color, Black]]]    |
| Tom       | [Size, Color]        | [L, White]     | [[[Tom, Size , L], [Tom, Color, White]]]      |
| Matteo    | [Size, Color]        | [M, Red]       | [[[Matteo, Size , M], [Matteo, Color, Red]]]  |

Answer 1

If the number of elements in the arrays in fixed, it is quite straightforward using the array and struct functions.如果数组中的元素数量是固定的，那么使用array和struct函数就非常简单了。 Here is a bit of code in scala.这是 Scala 中的一些代码。

val result = df
    .withColumn("concat_result", array((0 to 1).map(i => struct(
                     col("str1"),
                     col("array_of_str1").getItem(i),
                     col("array_of_str2").getItem(i)
    )) : _*))

And in python, since you were asking about pyspark:在 python 中，因为你问的是 pyspark：

import pyspark.sql.functions as F

df.withColumn("concat_result", F.array(*[ F.struct(
                  F.col("str1"),
                  F.col("array_of_str1").getItem(i),
                  F.col("array_of_str2").getItem(i))
              for i in range(2)]))

And you get the following schema:您将获得以下架构：

root
 |-- str1: string (nullable = true)
 |-- array_of_str1: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- array_of_str2: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- concat_result: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- str1: string (nullable = true)
 |    |    |-- col2: string (nullable = true)
 |    |    |-- col3: string (nullable = true)

Answer 2

Spark >= 2.4.x火花 >= 2.4.x

For dynamically values you can use high-order functions :对于动态值，您可以使用高阶函数：

import pyspark.sql.functions as f

expr = "TRANSFORM(arrays_zip(array_of_str1, array_of_str2), x -> struct(str1, concat(x.array_of_str1), concat(x.array_of_str2)))"
df = df.withColumn('concat_result', f.expr(expr))

df.show(truncate=False)

Schema and output:架构和输出：

root
 |-- array_of_str1: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- array_of_str2: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- str1: string (nullable = true)
 |-- concat_result: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- str1: string (nullable = true)
 |    |    |-- col2: string (nullable = true)
 |    |    |-- col3: string (nullable = true)

+-------------+-------------+------+-----------------------------------------+
|array_of_str1|array_of_str2|str1  |concat_result                            |
+-------------+-------------+------+-----------------------------------------+
|[Size, Color]|[M, Black]   |John  |[[John, Size, M], [John, Color, Black]]  |
|[Size, Color]|[L, White]   |Tom   |[[Tom, Size, L], [Tom, Color, White]]    |
|[Size, Color]|[M, Red]     |Matteo|[[Matteo, Size, M], [Matteo, Color, Red]]|
+-------------+-------------+------+-----------------------------------------+

将数组列转换为 PySpark 数据帧中的结构数组

问题描述

2 个解决方案

解决方案1
5 已采纳 2020-01-07 14:43:21

解决方案2
0 2020-12-30 02:04:45

Spark >= 2.4.x火花 >= 2.4.x

将数组列转换为 PySpark 数据帧中的结构数组

问题描述

2 个解决方案

解决方案1 5 已采纳 2020-01-07 14:43:21

解决方案2 0 2020-12-30 02:04:45

Spark >= 2.4.x火花 >= 2.4.x

解决方案1
5 已采纳 2020-01-07 14:43:21

解决方案2
0 2020-12-30 02:04:45