繁体   English   中英

Pyspark/SQL 将具有列表值的列连接到另一个 dataframe 列

[英]Pyspark/SQL join a column having list values to another dataframe column

我想按照此处要求的方式加入两个表, Pandas 将 dataframe 列中的列表与另一个 dataframe 合并

# Input Data Frame 
ID   LIST_VALUES
 1     [a,b,c]
 2     [a,n,t]
 3     [x]
 4     [h,h]


VALUE     MAPPING
 a         alpha
 b         bravo
 c         charlie
 n         november
 h         hotel
 t         tango
 x         xray

我想要以下 output,如何在 pyspark 或 SQL 中执行此操作?

# EXPECTED OUTPUT DATAFRAME

ID   LIST_VALUES    new_col
 1     [a,b,c]       alpha,bravo,charlie
 2     [a,n,t]       alpha,november,tango
 3     [x]           xray
 4     [h,h]         hotel

我根据提供的链接创建了以下数据和 output

带有pyspark DataFrame API的程序希望如下:

    # imports 
from pyspark.sql import functions as F
from pyspark.sql.window import Window

# replicating the data

cols = ['ID','LIST_VALUES']
row_1 = [1,['a','b','c']]
row_2 = [2,['a','n','t']]
row_3 = [3,['x']]
row_4 = [4, ['h','h']]
rows = [row_1, row_2,row_3,row_4]

df_1 = spark.createDataFrame(rows, cols)

cols = ['VALUE','MAPPING']
row_1 = ['a','alpha']
row_2 = ['b', 'bravo']
row_3 = ['c', 'charlie']
row_4 = ['n', 'november']
row_5 = ['h', 'hotel']
row_6 = ['t', 'tango']
row_7 = ['x', 'xray']

rows = [row_1, row_2,row_3,row_4, row_5, row_6, row_7]

df_a = spark.createDataFrame(rows, cols)

# we need to explode the LIST_VALUES Column first
df_1 = df_1.withColumn("EXP_LIST_VALUES",F.explode(F.col('LIST_VALUES')))
df_2 = df_1.select('ID','EXP_LIST_VALUES')

# then we can do a left join with df_2 and df_a

df_new = df_a.join(df_2,df_a.VALUE == df_2.EXP_LIST_VALUES,'left')

# applying a window functions 

df_output = df_new.select(F.col('ID'),
           F.collect_set(F.col('VALUE')).over(Window.partitionBy(F.col('ID'))).alias('LIST_VALUES'), \F.array_join(F.collect_set(F.col('MAPPING')).over(Window.partitionBy(F.col('ID'))),',').alias('new_col')).dropDuplicates()


display(df_output)

output 看起来像下面的 dataframe

# +---+-----------+--------------------+
# | ID|LIST_VALUES|             new_col|
# +---+-----------+--------------------+
# |  1|[c, b, a]  | bravo,charlie,alpha|
# |  2|[t, n, a]  |november,tango,alpha|
# |  3|      [x]  |                xray|
# |  4|      [h]  |               hotel|
# +---+-----------+--------------------|

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM