迭代 dataframe df_a 中的行并根据 Pyspark 中的 df_a 值更新 dataframe df_b

Question

I have a dataframe df_b which has to be updated based on dataframe df_a values我有一个 dataframe df_b必须根据 dataframe df_a值进行更新

df_a df_a

+-----+-----+------------+---------+
| id_1| id_2| header_oper| head_seq|
+-----+-----+------------+---------+
|  boy|    3|      insert|        1|
|  bat|    4|      delete|        3|
|  cat|    2|      insert|        1|
|  bat|    4|      update|        2|
|  bat|    5|   beforeimg|        1|
+-----+-----+------------+---------+

df_b (before) df_b （之前）

+-----+-----+
| id_1| id_2|
+-----+-----+
|  boy|    4|
|  bat|    5|
|  cat|    1|
+-----+-----+

The method I came up with:我想出的方法：

Sort df_a on 'head_seq'.在 'head_seq' 上对 df_a 进行排序。
Iterate df_a迭代 df_a
if 'header_oper'.isin('insert','update') then append that row to df_b如果 'header_oper'.isin('insert','update') 那么 append 那行到 df_b
if 'header_oper'.isin('delete','beforeimg') then subtract that row from df_b如果 'header_oper'.isin('delete','beforeimg') 然后从 df_b 中减去该行

Expected df_b (after):预期df_b （之后）：

+-----+-----+
| id_1| id_2|
+-----+-----+
|  boy|    4|
|  boy|    3|
|  cat|    2|
|  cat|    1|
+-----+-----+

Need help on how to iterate df_a and perform operations on df_b based on df_a values.需要有关如何迭代 df_a 并根据 df_a 值对 df_b 执行操作的帮助。

Answer 1

ds= spark.createDataFrame([('boy',4),('bat',5),('cat',1)],['id_1','id_2'])
df_op=spark.createDataFrame([('boy',3,'insert',1),('bat',4,'delete',3),('cat',2,'insert',1),('bat',4,'update',2),('bat',5,'beforeimg',1)], ['id_1','id_2','eff_op','seq'])

effective_op=df_op.groupBy('id_1').agg(max('seq').alias('seq')).join(df_op,['id_1','seq'])

ds_insert=ds.union(effective_op.select('id_1','id_2').filter("eff_op in ('insert')").orderBy(asc('id_1')))

ds_delete=ds_insert.join(effective_op.filter("eff_op in ('delete')").select("*"),['id_1'],'left').select(ds_insert.id_1, ds_insert.id_2).filter("eff_op is null")

display(ds_delete)

Answer 2

Ok I figured this out.好的，我想通了。 As there's beforeimg for every update, the order of the operations didn't matter.由于每次更新都有 beforeimg，因此操作顺序无关紧要。

I just had to add all the Inserts and Updates and then delete the Deletes and BeforeImgs我只需要添加所有插入和更新，然后删除删除和 BeforeImgs

Partitioning the operations and deselecting the header columns对操作进行分区并取消选择 header 列

ins=df_a.where(df_a['header_oper']=='insert')
ins=ins.select(id_1,id_2)

upd=df_a.where(df_a['header_oper']=='update')
upd=upd.select(id_1,id_2)

dele=df_a.where(df_a['header_oper']=='delete')
dele=dele.select(id_1,id_2)

bimg=df_a.where(df_a['header_oper']=='delete')
bimg=bimg.select(id_1,id_2)

Appending the Inserts and Updates to df_b将插入和更新附加到 df_b

df_b=df_b.union(ins)
df_b=df_b.union(upd)

Removing the Deletes and BeforeImgs from df_b从 df_b 中删除 Deletes 和 BeforeImgs

df_b=df_b.subtract(dele)
df_b=df_b.subtract(bimg)

Answer 3

I have a dataframe df_b which has to be updated based on dataframe df_a values我有一个 dataframe df_b必须根据 dataframe df_a值进行更新

df_a df_a

+-----+-----+------------+---------+
| id_1| id_2| header_oper| head_seq|
+-----+-----+------------+---------+
|  boy|    3|      insert|        1|
|  bat|    4|      delete|        3|
|  cat|    2|      insert|        1|
|  bat|    4|      update|        2|
|  bat|    5|   beforeimg|        1|
+-----+-----+------------+---------+

df_b (before) df_b （之前）

+-----+-----+
| id_1| id_2|
+-----+-----+
|  boy|    4|
|  bat|    5|
|  cat|    1|
+-----+-----+

The method I came up with:我想出的方法：

Sort df_a on 'head_seq'.对 'head_seq' 排序 df_a。
Iterate df_a迭代 df_a
if 'header_oper'.isin('insert','update') then append that row to df_b如果 'header_oper'.isin('insert','update') 则 append 该行到 df_b
if 'header_oper'.isin('delete','beforeimg') then subtract that row from df_b if 'header_oper'.isin('delete','beforeimg') 然后从 df_b 中减去该行

Expected df_b (after):预期df_b （之后）：

+-----+-----+
| id_1| id_2|
+-----+-----+
|  boy|    4|
|  boy|    3|
|  cat|    2|
|  cat|    1|
+-----+-----+

Need help on how to iterate df_a and perform operations on df_b based on df_a values.需要有关如何迭代 df_a 并根据 df_a 值对 df_b 执行操作的帮助。

迭代 dataframe df_a 中的行并根据 Pyspark 中的 df_a 值更新 dataframe df_b

问题描述

2 个解决方案

解决方案1
0 2020-08-07 22:29:12

解决方案2
0 2020-08-23 00:34:22

解决方案3
-1 2020-08-07 10:49:14

迭代 dataframe df_a 中的行并根据 Pyspark 中的 df_a 值更新 dataframe df_b

问题描述

2 个解决方案

解决方案1 0 2020-08-07 22:29:12

解决方案2 0 2020-08-23 00:34:22

解决方案3 -1 2020-08-07 10:49:14

解决方案1
0 2020-08-07 22:29:12

解决方案2
0 2020-08-23 00:34:22

解决方案3
-1 2020-08-07 10:49:14