[英]How to concat two array / list columns of different spark dataframes?
Need a concat dataframe. 需要一个concat数据框。 Columns from two different spark dataframes. 来自两个不同的spark数据帧的列。 Looking for pyspark code. 寻找pyspark代码。
df1.show()
+---------+
| value|
+---------+
|[1, 2, 3]|
+---------+
df2.show()
+------+
| value|
+------+
|[4, 5]|
+------+
I need a dataframe as bleow:
+------------+
| value |
+------------+
|[1,2,3,4,5] |
+------------+
Some educational aspects here as well, and you can strip out the .show(), some data generation first. 这里还有一些教育方面的内容,您可以删除.show(),首先生成一些数据。
Spark 2.4 assumed. 假设使用Spark 2.4。 Positional dependency is OK although some dispute if it is preserved with RDDs and such with just zipWithIndex; 位置依赖性是可以的,尽管是否存在一些争议,例如是否使用RDD和zipWithIndex保留它; I have no evidence to doubt that. 我没有证据对此表示怀疑。 No performance considerations in terms of explicit partitioning, but no UDFs used. 在显式分区方面没有性能方面的考虑,但是没有使用UDF。 Assuming same number of rows in both DFs. 假设两个DF中的行数相同。 DataSet not a pyspark object. DataSet不是pyspark对象。 Need rdd conversion. 需要rdd转换。
import pyspark.sql.functions as f
from pyspark.sql.functions import col, concat
df1 = spark.createDataFrame([ list([[x,x+1,x+2]]) for x in range(7)], ['value'])
df2 = spark.createDataFrame([ list([[x+10,x+20]]) for x in range(7)], ['value'])
dfA = df1.rdd.map(lambda r: r.value).zipWithIndex().toDF(['value', 'index'])
dfB = df2.rdd.map(lambda r: r.value).zipWithIndex().toDF(['value', 'index'])
df_inner_join = dfA.join(dfB, dfA.index == dfB.index)
new_names = ['value1', 'index1', 'value2', 'index2']
df_renamed = df_inner_join.toDF(*new_names) # Issues with column renames otherwise!
df_result = df_renamed.select(col("index1"), concat(col("value1"), col("value2")))
new_names_final = ['index', 'value']
df_result_final = df_result.toDF(*new_names_final)
Data In (generated) 数据输入(生成)
+---------+
| value|
+---------+
|[0, 1, 2]|
|[1, 2, 3]|
|[2, 3, 4]|
|[3, 4, 5]|
|[4, 5, 6]|
|[5, 6, 7]|
|[6, 7, 8]|
+---------+
+--------+
| value|
+--------+
|[10, 20]|
|[11, 21]|
|[12, 22]|
|[13, 23]|
|[14, 24]|
|[15, 25]|
|[16, 26]|
+--------+
Data Out 数据输出
+-----+-----------------+
|index| value|
+-----+-----------------+
| 0|[0, 1, 2, 10, 20]|
| 6|[6, 7, 8, 16, 26]|
| 5|[5, 6, 7, 15, 25]|
| 1|[1, 2, 3, 11, 21]|
| 3|[3, 4, 5, 13, 23]|
| 2|[2, 3, 4, 12, 22]|
| 4|[4, 5, 6, 14, 24]|
+-----+-----------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.