簡體   English   中英

PySpark 將 Dataframe 內的數組中的元素映射到另一個 Dataframe

[英]PySpark Mapping Elements in Array within a Dataframe to another Dataframe

我有兩個數據框。 第一個 dataframe 有一個數組作為其column2的值,我想將它與第二個 dataframe 連接起來,以便將數值映射到它們的字符串值。 元素的順序應該保持不變,因為它們按索引對應於column3中的數組元素。

df_one

 column1|  column2|        column3
----------------------------------
"thing1"|[1,2,3..]|[0.1,0.2,0.3..]
"thing2"|[1,2,3..]|[0.1,0.2,0.3..]
"thing3"|[1,2,3..]|[0.1,0.2,0.3..]
...

df_two

columnA|columnB
---------------
      1|"item1"
      2|"item2"
      3|"item3"
...

有沒有辦法像這樣加入這些數據框和 select 列:

column1 |                  newColumn|        column3
----------------------------------------------------
"thing1"|["item1","item2","item3"..]|[0.1,0.2,0.3..]
"thing2"|["item1","item2","item3"..]|[0.1,0.2,0.3..]
"thing3"|["item1","item2","item3"..]|[0.1,0.2,0.3..]
...

正如評論中提到的,在第 2 列explode然后在columnA joincolumn2的好方法。 但是,當您對數據進行分組時,我不確定是否會始終保留該順序。

To be sure, and avoid costly UDF in python, you could use posexplode to keep track of the position of each element, and then an ordered window function to build the list back:

df_one = spark.createDataFrame([("thing1", [1, 2, 3], "X"), ("thing2", [1, 2, 3], "Y"), ("thing3", [1, 2, 3], "Z")],
                               ["column1", "column2", "column3"])
df_two = spark.createDataFrame([(1, "item1"), (2, "item2"), (3, "item3")],
                               ["columnA", "columnB"])

w= Window.partitionBy("column1").orderBy("pos")

df_one\
    .select("*", f.posexplode("column2").alias("pos", "columnA"))\
    .join(df_two, ['columnA'])\
    .withColumn("newColumn", f.collect_list("columnB").over(w))\
    .where(f.col("pos")+1 == f.size(f.col("column2")))\
    .select("column1", "newColumn", "column3")\
    .show(truncate=False)
+-------+---------------------+-------+
|column1|newColumn            |column3|
+-------+---------------------+-------+
|thing1 |[item1, item2, item3]|X      |
|thing2 |[item1, item2, item3]|Y      |
|thing3 |[item1, item2, item3]|Z      |
+-------+---------------------+-------+

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM