如何连接不同火花数据帧的两个数组/列表列？

Question

Need a concat dataframe. 需要一个concat数据框。 Columns from two different spark dataframes. 来自两个不同的spark数据帧的列。 Looking for pyspark code. 寻找pyspark代码。

df1.show()
+---------+
|    value|
+---------+
|[1, 2, 3]|
+---------+

df2.show()
+------+
| value|
+------+
|[4, 5]|
+------+


I need a dataframe as bleow:
+------------+
| value      |
+------------+
|[1,2,3,4,5] |
+------------+

Answer 1

Some educational aspects here as well, and you can strip out the .show(), some data generation first. 这里还有一些教育方面的内容，您可以删除.show（），首先生成一些数据。

Spark 2.4 assumed. 假设使用Spark 2.4。 Positional dependency is OK although some dispute if it is preserved with RDDs and such with just zipWithIndex; 位置依赖性是可以的，尽管是否存在一些争议，例如是否使用RDD和zipWithIndex保留它； I have no evidence to doubt that. 我没有证据对此表示怀疑。 No performance considerations in terms of explicit partitioning, but no UDFs used. 在显式分区方面没有性能方面的考虑，但是没有使用UDF。 Assuming same number of rows in both DFs. 假设两个DF中的行数相同。 DataSet not a pyspark object. DataSet不是pyspark对象。 Need rdd conversion. 需要rdd转换。

import pyspark.sql.functions as f
from pyspark.sql.functions import col, concat

df1 = spark.createDataFrame([ list([[x,x+1,x+2]]) for x in range(7)], ['value'])
df2 = spark.createDataFrame([ list([[x+10,x+20]]) for x in range(7)], ['value'])
dfA = df1.rdd.map(lambda r: r.value).zipWithIndex().toDF(['value', 'index'])
dfB = df2.rdd.map(lambda r: r.value).zipWithIndex().toDF(['value', 'index'])

df_inner_join = dfA.join(dfB, dfA.index == dfB.index)
new_names = ['value1', 'index1', 'value2', 'index2']
df_renamed = df_inner_join.toDF(*new_names) # Issues with column renames otherwise!

df_result = df_renamed.select(col("index1"), concat(col("value1"), col("value2"))) 
new_names_final = ['index', 'value']
df_result_final = df_result.toDF(*new_names_final)

Data In (generated) 数据输入（生成）

+---------+
|    value|
+---------+
|[0, 1, 2]|
|[1, 2, 3]|
|[2, 3, 4]|
|[3, 4, 5]|
|[4, 5, 6]|
|[5, 6, 7]|
|[6, 7, 8]|
+---------+

+--------+
|   value|
+--------+
|[10, 20]|
|[11, 21]|
|[12, 22]|
|[13, 23]|
|[14, 24]|
|[15, 25]|
|[16, 26]|
+--------+

Data Out 数据输出

+-----+-----------------+
|index|            value|
+-----+-----------------+
|    0|[0, 1, 2, 10, 20]|
|    6|[6, 7, 8, 16, 26]|
|    5|[5, 6, 7, 15, 25]|
|    1|[1, 2, 3, 11, 21]|
|    3|[3, 4, 5, 13, 23]|
|    2|[2, 3, 4, 12, 22]|
|    4|[4, 5, 6, 14, 24]|
+-----+-----------------+

如何连接不同火花数据帧的两个数组/列表列？

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-07-20 10:42:47

如何连接不同火花数据帧的两个数组/列表列？

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-07-20 10:42:47

解决方案1
1 已采纳 2019-07-20 10:42:47