![](/img/trans.png)
[英]How to query/extract array elements from within a pyspark dataframe
[英]PySpark Mapping Elements in Array within a Dataframe to another Dataframe
我有兩個數據框。 第一個 dataframe 有一個數組作為其column2
的值,我想將它與第二個 dataframe 連接起來,以便將數值映射到它們的字符串值。 元素的順序應該保持不變,因為它們按索引對應於column3
中的數組元素。
df_one
:
column1| column2| column3
----------------------------------
"thing1"|[1,2,3..]|[0.1,0.2,0.3..]
"thing2"|[1,2,3..]|[0.1,0.2,0.3..]
"thing3"|[1,2,3..]|[0.1,0.2,0.3..]
...
df_two
:
columnA|columnB
---------------
1|"item1"
2|"item2"
3|"item3"
...
有沒有辦法像這樣加入這些數據框和 select 列:
column1 | newColumn| column3
----------------------------------------------------
"thing1"|["item1","item2","item3"..]|[0.1,0.2,0.3..]
"thing2"|["item1","item2","item3"..]|[0.1,0.2,0.3..]
"thing3"|["item1","item2","item3"..]|[0.1,0.2,0.3..]
...
正如評論中提到的,在第 2 列explode
然后在columnA
join
是column2
的好方法。 但是,當您對數據進行分組時,我不確定是否會始終保留該順序。
To be sure, and avoid costly UDF in python, you could use posexplode
to keep track of the position of each element, and then an ordered window function to build the list back:
df_one = spark.createDataFrame([("thing1", [1, 2, 3], "X"), ("thing2", [1, 2, 3], "Y"), ("thing3", [1, 2, 3], "Z")],
["column1", "column2", "column3"])
df_two = spark.createDataFrame([(1, "item1"), (2, "item2"), (3, "item3")],
["columnA", "columnB"])
w= Window.partitionBy("column1").orderBy("pos")
df_one\
.select("*", f.posexplode("column2").alias("pos", "columnA"))\
.join(df_two, ['columnA'])\
.withColumn("newColumn", f.collect_list("columnB").over(w))\
.where(f.col("pos")+1 == f.size(f.col("column2")))\
.select("column1", "newColumn", "column3")\
.show(truncate=False)
+-------+---------------------+-------+
|column1|newColumn |column3|
+-------+---------------------+-------+
|thing1 |[item1, item2, item3]|X |
|thing2 |[item1, item2, item3]|Y |
|thing3 |[item1, item2, item3]|Z |
+-------+---------------------+-------+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.