简体   繁体   English

迭代 dataframe df_a 中的行并根据 Pyspark 中的 df_a 值更新 dataframe df_b

[英]Iterate rows in dataframe df_a and update dataframe df_b based on df_a values in Pyspark

I have a dataframe df_b which has to be updated based on dataframe df_a values我有一个 dataframe df_b必须根据 dataframe df_a值进行更新

df_a df_a

+-----+-----+------------+---------+
| id_1| id_2| header_oper| head_seq|
+-----+-----+------------+---------+
|  boy|    3|      insert|        1|
|  bat|    4|      delete|        3|
|  cat|    2|      insert|        1|
|  bat|    4|      update|        2|
|  bat|    5|   beforeimg|        1|
+-----+-----+------------+---------+

df_b (before) df_b (之前)

+-----+-----+
| id_1| id_2|
+-----+-----+
|  boy|    4|
|  bat|    5|
|  cat|    1|
+-----+-----+

The method I came up with:我想出的方法:

  1. Sort df_a on 'head_seq'.在 'head_seq' 上对 df_a 进行排序。
  2. Iterate df_a迭代 df_a
  3. if 'header_oper'.isin('insert','update') then append that row to df_b如果 'header_oper'.isin('insert','update') 那么 append 那行到 df_b
  4. if 'header_oper'.isin('delete','beforeimg') then subtract that row from df_b如果 'header_oper'.isin('delete','beforeimg') 然后从 df_b 中减去该行

Expected df_b (after):预期df_b (之后):

+-----+-----+
| id_1| id_2|
+-----+-----+
|  boy|    4|
|  boy|    3|
|  cat|    2|
|  cat|    1|
+-----+-----+



Need help on how to iterate df_a and perform operations on df_b based on df_a values.需要有关如何迭代 df_a 并根据 df_a 值对 df_b 执行操作的帮助。

ds= spark.createDataFrame([('boy',4),('bat',5),('cat',1)],['id_1','id_2'])
df_op=spark.createDataFrame([('boy',3,'insert',1),('bat',4,'delete',3),('cat',2,'insert',1),('bat',4,'update',2),('bat',5,'beforeimg',1)], ['id_1','id_2','eff_op','seq'])

effective_op=df_op.groupBy('id_1').agg(max('seq').alias('seq')).join(df_op,['id_1','seq'])

ds_insert=ds.union(effective_op.select('id_1','id_2').filter("eff_op in ('insert')").orderBy(asc('id_1')))

ds_delete=ds_insert.join(effective_op.filter("eff_op in ('delete')").select("*"),['id_1'],'left').select(ds_insert.id_1, ds_insert.id_2).filter("eff_op is null")

display(ds_delete)

Ok I figured this out.好的,我想通了。 As there's beforeimg for every update, the order of the operations didn't matter.由于每次更新都有 beforeimg,因此操作顺序无关紧要。

I just had to add all the Inserts and Updates and then delete the Deletes and BeforeImgs我只需要添加所有插入和更新,然后删除删除和 BeforeImgs


Partitioning the operations and deselecting the header columns对操作进行分区并取消选择 header 列

ins=df_a.where(df_a['header_oper']=='insert')
ins=ins.select(id_1,id_2)

upd=df_a.where(df_a['header_oper']=='update')
upd=upd.select(id_1,id_2)

dele=df_a.where(df_a['header_oper']=='delete')
dele=dele.select(id_1,id_2)

bimg=df_a.where(df_a['header_oper']=='delete')
bimg=bimg.select(id_1,id_2)

Appending the Inserts and Updates to df_b将插入和更新附加到 df_b

df_b=df_b.union(ins)
df_b=df_b.union(upd)

Removing the Deletes and BeforeImgs from df_b从 df_b 中删除 Deletes 和 BeforeImgs

df_b=df_b.subtract(dele)
df_b=df_b.subtract(bimg)

I have a dataframe df_b which has to be updated based on dataframe df_a values我有一个 dataframe df_b必须根据 dataframe df_a值进行更新

df_a df_a

+-----+-----+------------+---------+
| id_1| id_2| header_oper| head_seq|
+-----+-----+------------+---------+
|  boy|    3|      insert|        1|
|  bat|    4|      delete|        3|
|  cat|    2|      insert|        1|
|  bat|    4|      update|        2|
|  bat|    5|   beforeimg|        1|
+-----+-----+------------+---------+

df_b (before) df_b (之前)

+-----+-----+
| id_1| id_2|
+-----+-----+
|  boy|    4|
|  bat|    5|
|  cat|    1|
+-----+-----+

The method I came up with:我想出的方法:

  1. Sort df_a on 'head_seq'.对 'head_seq' 排序 df_a。
  2. Iterate df_a迭代 df_a
  3. if 'header_oper'.isin('insert','update') then append that row to df_b如果 'header_oper'.isin('insert','update') 则 append 该行到 df_b
  4. if 'header_oper'.isin('delete','beforeimg') then subtract that row from df_b if 'header_oper'.isin('delete','beforeimg') 然后从 df_b 中减去该行

Expected df_b (after):预期df_b (之后):

+-----+-----+
| id_1| id_2|
+-----+-----+
|  boy|    4|
|  boy|    3|
|  cat|    2|
|  cat|    1|
+-----+-----+



Need help on how to iterate df_a and perform operations on df_b based on df_a values.需要有关如何迭代 df_a 并根据 df_a 值对 df_b 执行操作的帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 根据 df_A 和 df_B 之间的差异,始终将 append 新行添加到 dataframe、df_A - Consistently append new rows to a dataframe, df_A, based on the differences between df_A and df_B 如何将现有 dataframe df_a 中的一行复制到新的 dataframe df_b 中? - How do I copy a row from an existing dataframe df_a into a new dataframe df_b? 从 df_a 中的特定行,计算过去一年在 df_b 中的出现次数 - From specific row in a df_a, to count its occurrences in the past a year in df_b pandas.DataFrame:是否根据DF B中的数据过滤DF A中的行? - pandas.DataFrame: Filter rows of df A based on data in df B? 在 PySpark 数据帧 DF 中在用户级别迭代(循环) - Iterate (loop) at user level in PySpark dataframe DF Pandas - 遍历 dataframe 行并更新 df(一行代码) - Pandas - iterate over dataframe rows and update df (one line of code) Pandas DataFrame-根据df中的数据向df添加行 - Pandas DataFrame - Adding rows to df based on data in df python中的Pandas数据框:根据df2中的行从df1中删除行 - Pandas dataframe in python: Removing rows from df1 based on rows in df2 根据df1中的3个值与df2中的3个值匹配,在数据框中填充新列 - Filling new column in a dataframe based on 3 values in df1 matching 3 values in df2 PySpark df.persist() 破坏数据帧 - PySpark df.persist() corrupts dataframe
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM