如何在 Pyspark 中按元素连接两个 ArrayType(StringType()) 列？

Question

我在 spark 数据ArrayType(StringType())有两个ArrayType(StringType())列，我想按元素连接这两列：

输入：

+-------------+-------------+
|col1         |col2         |
+-------------+-------------+
|['a','b']    |['c','d']    |
|['a','b','c']|['e','f','g']|
+-------------+-------------+

输出：

+-------------+-------------+----------------+
|col1         |col2         |col3            |
+-------------+-------------+----------------+
|['a','b']    |['c','d']    |['ac', 'bd']    |
|['a','b','c']|['e','f','g']|['ae','bf','cg']|
+-------------+----------- -+----------------+

谢谢。

Answer 1

对于 Spark 2.4+，您可以使用如下transform函数：

col3_expr = "transform(col1, (x, i) -> concat(x, col2[i]))"
df.withColumn("col3", expr(col3_expr)).show()

transform函数将第一个数组列col1作为参数，迭代其元素并应用 lambda 函数(x, i) -> concat(x, col2[i])其中x实际元素和i其索引用于获取来自数组col2对应元素。

给出：

+------+------+--------+
|  col1|  col2|    col3|
+------+------+--------+
|[a, b]|[c, d]|[ac, bd]|
+------+------+--------+

或者使用高阶zip_with函数更简单：

df.withColumn("col3", expr("zip_with(col1, col2, (x, y) -> concat(x, y))"))

Answer 2

这是可用于更新的非原始问题的替代答案。 使用 array 和 array_except 来演示这些方法的使用。 接受的答案更优雅。

from pyspark.sql.functions import *
from pyspark.sql.types import *

# Arbitrary max number of elements to apply array over, need not broadcast such a small amount of data afaik.
max_entries = 5 

# Gen in this case numeric data, etc. 3 rows with 2 arrays of varying length,but per row constant length. 
dfA = spark.createDataFrame([   ( list([x,x+1,4, x+100]), 4, list([x+100,x+200,999,x+500])   ) for x in range(3)], ['array1', 'value1', 'array2'] ).withColumn("s",size(col("array1")))    
dfB = spark.createDataFrame([   ( list([x,x+1]), 4, list([x+100,x+200])   ) for x in range(5)], ['array1', 'value1', 'array2'] ).withColumn("s",size(col("array1"))) 
df = dfA.union(dfB)

# concat the array elements which are variable in size up to a max amount.
df2 = df.select(( [concat(col("array1")[i], col("array2")[i]) for i in range(max_entries)]))
df3 = df2.withColumn("res", array(df2.schema.names))

# Get results but strip out null entires from array.
df3.select(array_except(df3.res, array(lit(None)))).show(truncate=False)

无法获取要传递到范围的列的 s 值。

这将返回：

+------------------------------+
|array_except(res, array(NULL))|
+------------------------------+
|[0100, 1200, 4999, 100500]    |
|[1101, 2201, 4999, 101501]    |
|[2102, 3202, 4999, 102502]    |
|[0100, 1200]                  |
|[1101, 2201]                  |
|[2102, 3202]                  |
|[3103, 4203]                  |
|[4104, 5204]                  |
+------------------------------+

Answer 3

它不会真正按比例缩放，但是您可以获取每个数组中的0th和1st个条目，然后说col3是a[0] + b[0] ，然后a[1] + b[1] 。 使所有 4 个条目分开值，然后将它们合并输出。

Answer 4

这是一个通用的答案。 只需查看 res 即可获得结果。 2 个相同大小的数组，因此两者都有 n 个元素。

from pyspark.sql.functions import *
from pyspark.sql.types import *

# Gen in this case numeric data, etc. 3 rows with 2 arrays of varying length, but both the same length as in your example
df = spark.createDataFrame([   ( list([x,x+1,4, x+100]), 4, list([x+100,x+200,999,x+500])   ) for x in range(3)], ['array1', 'value1', 'array2'] )    
num_array_elements = len(df.select("array1").first()[0])

# concat
df2 = df.select(([ concat(col("array1")[i], col("array2")[i]) for i in range(num_array_elements)]))
df2.withColumn("res", array(df2.schema.names)).show(truncate=False)

返回：

如何在 Pyspark 中按元素连接两个 ArrayType(StringType()) 列？

问题描述

4 个解决方案

解决方案1
4 已采纳 2020-01-10 19:20:47

解决方案2
1 2020-01-14 13:43:13

解决方案3
0 2020-01-10 19:00:43

解决方案4
0 2020-01-11 15:10:33

如何在 Pyspark 中按元素连接两个 ArrayType(StringType()) 列？

问题描述

4 个解决方案

解决方案1 4 已采纳 2020-01-10 19:20:47

解决方案2 1 2020-01-14 13:43:13

解决方案3 0 2020-01-10 19:00:43

解决方案4 0 2020-01-11 15:10:33

解决方案1
4 已采纳 2020-01-10 19:20:47

解决方案2
1 2020-01-14 13:43:13

解决方案3
0 2020-01-10 19:00:43

解决方案4
0 2020-01-11 15:10:33