[英]Iterate rows in dataframe df_a and update dataframe df_b based on df_a values in Pyspark
I have a dataframe df_b which has to be updated based on dataframe df_a values我有一个 dataframe df_b必须根据 dataframe df_a值进行更新
df_a df_a
+-----+-----+------------+---------+
| id_1| id_2| header_oper| head_seq|
+-----+-----+------------+---------+
| boy| 3| insert| 1|
| bat| 4| delete| 3|
| cat| 2| insert| 1|
| bat| 4| update| 2|
| bat| 5| beforeimg| 1|
+-----+-----+------------+---------+
df_b (before) df_b (之前)
+-----+-----+
| id_1| id_2|
+-----+-----+
| boy| 4|
| bat| 5|
| cat| 1|
+-----+-----+
The method I came up with:我想出的方法:
Expected df_b (after):预期df_b (之后):
+-----+-----+
| id_1| id_2|
+-----+-----+
| boy| 4|
| boy| 3|
| cat| 2|
| cat| 1|
+-----+-----+
Need help on how to iterate df_a and perform operations on df_b based on df_a values.需要有关如何迭代 df_a 并根据 df_a 值对 df_b 执行操作的帮助。
ds= spark.createDataFrame([('boy',4),('bat',5),('cat',1)],['id_1','id_2'])
df_op=spark.createDataFrame([('boy',3,'insert',1),('bat',4,'delete',3),('cat',2,'insert',1),('bat',4,'update',2),('bat',5,'beforeimg',1)], ['id_1','id_2','eff_op','seq'])
effective_op=df_op.groupBy('id_1').agg(max('seq').alias('seq')).join(df_op,['id_1','seq'])
ds_insert=ds.union(effective_op.select('id_1','id_2').filter("eff_op in ('insert')").orderBy(asc('id_1')))
ds_delete=ds_insert.join(effective_op.filter("eff_op in ('delete')").select("*"),['id_1'],'left').select(ds_insert.id_1, ds_insert.id_2).filter("eff_op is null")
display(ds_delete)
Ok I figured this out.好的,我想通了。 As there's beforeimg for every update, the order of the operations didn't matter.
由于每次更新都有 beforeimg,因此操作顺序无关紧要。
I just had to add all the Inserts and Updates and then delete the Deletes and BeforeImgs我只需要添加所有插入和更新,然后删除删除和 BeforeImgs
Partitioning the operations and deselecting the header columns对操作进行分区并取消选择 header 列
ins=df_a.where(df_a['header_oper']=='insert')
ins=ins.select(id_1,id_2)
upd=df_a.where(df_a['header_oper']=='update')
upd=upd.select(id_1,id_2)
dele=df_a.where(df_a['header_oper']=='delete')
dele=dele.select(id_1,id_2)
bimg=df_a.where(df_a['header_oper']=='delete')
bimg=bimg.select(id_1,id_2)
Appending the Inserts and Updates to df_b将插入和更新附加到 df_b
df_b=df_b.union(ins)
df_b=df_b.union(upd)
Removing the Deletes and BeforeImgs from df_b从 df_b 中删除 Deletes 和 BeforeImgs
df_b=df_b.subtract(dele)
df_b=df_b.subtract(bimg)
I have a dataframe df_b which has to be updated based on dataframe df_a values我有一个 dataframe df_b必须根据 dataframe df_a值进行更新
df_a df_a
+-----+-----+------------+---------+
| id_1| id_2| header_oper| head_seq|
+-----+-----+------------+---------+
| boy| 3| insert| 1|
| bat| 4| delete| 3|
| cat| 2| insert| 1|
| bat| 4| update| 2|
| bat| 5| beforeimg| 1|
+-----+-----+------------+---------+
df_b (before) df_b (之前)
+-----+-----+
| id_1| id_2|
+-----+-----+
| boy| 4|
| bat| 5|
| cat| 1|
+-----+-----+
The method I came up with:我想出的方法:
Expected df_b (after):预期df_b (之后):
+-----+-----+
| id_1| id_2|
+-----+-----+
| boy| 4|
| boy| 3|
| cat| 2|
| cat| 1|
+-----+-----+
Need help on how to iterate df_a and perform operations on df_b based on df_a values.需要有关如何迭代 df_a 并根据 df_a 值对 df_b 执行操作的帮助。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.