PySpark 將 Dataframe 內的數組中的元素映射到另一個 Dataframe

Question

我有兩個數據框。 第一個 dataframe 有一個數組作為其column2的值，我想將它與第二個 dataframe 連接起來，以便將數值映射到它們的字符串值。 元素的順序應該保持不變，因為它們按索引對應於column3中的數組元素。

df_one ：

 column1|  column2|        column3
----------------------------------
"thing1"|[1,2,3..]|[0.1,0.2,0.3..]
"thing2"|[1,2,3..]|[0.1,0.2,0.3..]
"thing3"|[1,2,3..]|[0.1,0.2,0.3..]
...

df_two ：

columnA|columnB
---------------
      1|"item1"
      2|"item2"
      3|"item3"
...

有沒有辦法像這樣加入這些數據框和 select 列：

column1 |                  newColumn|        column3
----------------------------------------------------
"thing1"|["item1","item2","item3"..]|[0.1,0.2,0.3..]
"thing2"|["item1","item2","item3"..]|[0.1,0.2,0.3..]
"thing3"|["item1","item2","item3"..]|[0.1,0.2,0.3..]
...

Answer 1

正如評論中提到的，在第 2 列explode然后在columnA join是column2的好方法。 但是，當您對數據進行分組時，我不確定是否會始終保留該順序。

To be sure, and avoid costly UDF in python, you could use posexplode to keep track of the position of each element, and then an ordered window function to build the list back:

df_one = spark.createDataFrame([("thing1", [1, 2, 3], "X"), ("thing2", [1, 2, 3], "Y"), ("thing3", [1, 2, 3], "Z")],
                               ["column1", "column2", "column3"])
df_two = spark.createDataFrame([(1, "item1"), (2, "item2"), (3, "item3")],
                               ["columnA", "columnB"])

w= Window.partitionBy("column1").orderBy("pos")

df_one\
    .select("*", f.posexplode("column2").alias("pos", "columnA"))\
    .join(df_two, ['columnA'])\
    .withColumn("newColumn", f.collect_list("columnB").over(w))\
    .where(f.col("pos")+1 == f.size(f.col("column2")))\
    .select("column1", "newColumn", "column3")\
    .show(truncate=False)

+-------+---------------------+-------+
|column1|newColumn            |column3|
+-------+---------------------+-------+
|thing1 |[item1, item2, item3]|X      |
|thing2 |[item1, item2, item3]|Y      |
|thing3 |[item1, item2, item3]|Z      |
+-------+---------------------+-------+

PySpark 將 Dataframe 內的數組中的元素映射到另一個 Dataframe

問題描述

1 個解決方案

解決方案1
0 2021-11-24 23:07:45

PySpark 將 Dataframe 內的數組中的元素映射到另一個 Dataframe

問題描述

1 個解決方案

解決方案1 0 2021-11-24 23:07:45

解決方案1
0 2021-11-24 23:07:45