![](/img/trans.png)
[英]How to iterate through rows of a column of a unknown data-frame in pyspark
[英]how to add rows to a data frame that are in another data frame by a column in pyspark
我有 2 個 dfs,我想將第二個 df 中的行移到第一個。 但我只想在 cid 列中的值不在第一行時添加這些行。
df1
x y z cid
4 8 1 1
7 5 6 2
7 3 5 3
df2
x y z cid
8 4 5 1
1 2 9 2
8 6 4 3
4 5 4 4
result:
x y z cid
4 8 1 1
7 5 6 2
7 3 5 3
4 5 4 4
你可以試試下面的代碼。
from pyspark.sql.functions import *
# Create DataFrame df1
df1 = spark.createDataFrame([(4,8,1,1), (7,5,6,2), (7,3,5,3)], ["x", "y", "z", "cid"])
# Create DataFrame df2
df2 = spark.createDataFrame([(8,4,5,1), (1,2,9,2), (8,6,4,3), (4,5,4,4)], ["x", "y", "z", "cid"])
# Get the values from cid column from df1
col1 = df1.select(collect_set("cid")).collect()[0][0]
# Filter the dataframe df2 where cid values present in df2 but not in df1.
df3 = df2.filter(~df2["cid"].isin(col1))
# Union df1 and df3.
df4 = df1.union(df3)
df4.show()
// Output
+---+---+---+---+
| x| y| z|cid|
+---+---+---+---+
| 4| 8| 1| 1|
| 7| 5| 6| 2|
| 7| 3| 5| 3|
| 4| 5| 4| 4|
+---+---+---+---+
我希望這有幫助。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.