简体   繁体   English

如何连接不同火花数据帧的两个数组/列表列?

[英]How to concat two array / list columns of different spark dataframes?

Need a concat dataframe. 需要一个concat数据框。 Columns from two different spark dataframes. 来自两个不同的spark数据帧的列。 Looking for pyspark code. 寻找pyspark代码。

df1.show()
+---------+
|    value|
+---------+
|[1, 2, 3]|
+---------+

df2.show()
+------+
| value|
+------+
|[4, 5]|
+------+


I need a dataframe as bleow:
+------------+
| value      |
+------------+
|[1,2,3,4,5] |
+------------+

Some educational aspects here as well, and you can strip out the .show(), some data generation first. 这里还有一些教育方面的内容,您可以删除.show(),首先生成一些数据。

Spark 2.4 assumed. 假设使用Spark 2.4。 Positional dependency is OK although some dispute if it is preserved with RDDs and such with just zipWithIndex; 位置依赖性是可以的,尽管是否存在一些争议,例如是否使用RDD和zipWithIndex保留它; I have no evidence to doubt that. 我没有证据对此表示怀疑。 No performance considerations in terms of explicit partitioning, but no UDFs used. 在显式分区方面没有性能方面的考虑,但是没有使用UDF。 Assuming same number of rows in both DFs. 假设两个DF中的行数相同。 DataSet not a pyspark object. DataSet不是pyspark对象。 Need rdd conversion. 需要rdd转换。

import pyspark.sql.functions as f
from pyspark.sql.functions import col, concat

df1 = spark.createDataFrame([ list([[x,x+1,x+2]]) for x in range(7)], ['value'])
df2 = spark.createDataFrame([ list([[x+10,x+20]]) for x in range(7)], ['value'])
dfA = df1.rdd.map(lambda r: r.value).zipWithIndex().toDF(['value', 'index'])
dfB = df2.rdd.map(lambda r: r.value).zipWithIndex().toDF(['value', 'index'])

df_inner_join = dfA.join(dfB, dfA.index == dfB.index)
new_names = ['value1', 'index1', 'value2', 'index2']
df_renamed = df_inner_join.toDF(*new_names) # Issues with column renames otherwise!

df_result = df_renamed.select(col("index1"), concat(col("value1"), col("value2"))) 
new_names_final = ['index', 'value']
df_result_final = df_result.toDF(*new_names_final)

Data In (generated) 数据输入(生成)

+---------+
|    value|
+---------+
|[0, 1, 2]|
|[1, 2, 3]|
|[2, 3, 4]|
|[3, 4, 5]|
|[4, 5, 6]|
|[5, 6, 7]|
|[6, 7, 8]|
+---------+

+--------+
|   value|
+--------+
|[10, 20]|
|[11, 21]|
|[12, 22]|
|[13, 23]|
|[14, 24]|
|[15, 25]|
|[16, 26]|
+--------+

Data Out 数据输出

+-----+-----------------+
|index|            value|
+-----+-----------------+
|    0|[0, 1, 2, 10, 20]|
|    6|[6, 7, 8, 16, 26]|
|    5|[5, 6, 7, 15, 25]|
|    1|[1, 2, 3, 11, 21]|
|    3|[3, 4, 5, 13, 23]|
|    2|[2, 3, 4, 12, 22]|
|    4|[4, 5, 6, 14, 24]|
+-----+-----------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在 pandas 中连接两个具有不同列的数据帧 - Concat two dataframes with different columns in pandas 如何在 spark 中对具有不同列数的两个 DataFrame 执行联合? - How to perform union on two DataFrames with different amounts of columns in spark? 如何使用不同的列动态连接多个数据框? - How to concat multiple dataframes dynamically, with different columns? Python Pandas Concat 具有不同列和相同行的数据帧列表 - Python Pandas Concat list of Dataframes With Different Columns and Same Rows 如何将pyspark中的两个数据框与结构或数组中的不同列合并? - How to merge two dataframes in pyspark with different columns inside struct or array? 连接具有不同索引的两个数据帧 - Concat two dataframes with different indices 合并两个具有不同列的 spark 数据框以获取所有列 - Merge two spark dataframes with different columns to get all columns Python Pandas - 如何将具有两个不同列的两个数据帧合并到两个列表列 - Python Pandas - How to merge two dataframes that have two different columns to two list columns Pandas; 如何连接两个数据框但仅连接相同的列? - Pandas; how to concat two dataframes but only the columns that are the same? 如何在python中连接两个数据框 - How to concat two dataframes in python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM