简体   繁体   English

如果行与另一个 DataFrame 中的 ID 匹配并且时间戳低于其他帧时间戳,如何过滤 Scala Spark DataFrame

[英]How to filter Scala Spark DataFrame if row matches ID in another DataFrame and timestamp is below the other frames timestamp

I want to filter out entries in a DataFrame of message events based on when they were edited.我想根据编辑时间过滤掉消息事件的 DataFrame 中的条目。 I have a DataFrame that has the message events, and another DataFrame that represents when/if they were edited.我有一个 DataFrame 具有消息事件,另一个 DataFrame 代表何时/如果它们被编辑。 The deletion should delete rows in the message table if they have a matching index in the edited table AND if the timestamp in the message table is below the corresponding edit event.如果消息表中的行在已编辑的表中有匹配的索引并且消息表中的时间戳低于相应的编辑事件,则删除应删除消息表中的行。

The Edited DataFrame is:编辑后的 DataFrame 是:

+----------+-------------------+
| timestamp|index              |
+----------+-------------------+
|1556247980|                 78|
|1558144430|                 87|
|1549964820|                 99|
+----------+-------------------+

The Message DataFrame is:消息 DataFrame 是:

+-------------------+--------------------+------------------+--------------------+
|index              |  commonResponseText|publishedTimestamp|  commonResponseText|
+-------------------+--------------------+------------------+--------------------+
|                 78|Voluptatem enim a...|        1556247974|Voluptatem enim a...|
|                 87|Ut enim enim sunt...|        1558144420|Ut enim enim sunt...|
|                 99|Et est perferendi...|        1549964815|Et est perferendi...|
|                 78|Voluptatem porro ...|        1556248000|Voluptatem porro ...|
|                 87|Atque quod est au...|        1549965000|Atque quod est au...|
+-------------------+--------------------+------------------+--------------------+

I want the result to be:我希望结果是:

+-------------------+--------------------+------------------+--------------------+
|commonResponseIndex|  index             |publishedTimestamp|  commonResponseText|
+-------------------+--------------------+------------------+--------------------+
|                 78|Voluptatem porro ...|        1556248000|Voluptatem porro ...|
|                 87|Atque quod est au...|        1549965000|Atque quod est au...|
+-------------------+--------------------+------------------+--------------------+

Thanks for the help!谢谢您的帮助!

You can aggregate your message table, join it with edited table and filter您可以聚合您的消息表,将其与已编辑的表连接并过滤

import pyspark.sql.functions as F
# Test dataframe
tst=sqlContext.createDataFrame([('A',2),('B',2),('A',2),('A',3),('B',4),('A',2),('B',2),('c',9)],schema=("id","count"))
tst1 = sqlContext.createDataFrame([('A',4),('B',1)],schema=("id","val"))
# Aggregate and join
tst_g=tst.groupby('id').agg(F.max('count').alias('count'))
tst_j= tst_g.join(tst1,tst_g.id==tst1.id,'left')
# Filter result
tst_f = tst_j.where((F.col('count')>=F.col('val'))|(F.col('val').isNull()))

The result is:结果是:

tst_j.show()

+---+-----+----+----+
| id|count|  id| val|
+---+-----+----+----+
|  c|    9|null|null|
|  B|    4|   B|   1|
|  A|    3|   A|   4|
+---+-----+----+----+
 tst_f.show()
+---+-----+----+----+
| id|count|  id| val|
+---+-----+----+----+
|  c|    9|null|null|
|  B|    4|   B|   1|
+---+-----+----+----+

Finally, you can drop the irrelevant columns.最后,您可以删除不相关的列。

If you need the full data, then you can join the update table with message table and do the same.如果您需要完整的数据,那么您可以将更新表与消息表连接起来并执行相同的操作。 If the update table is small, then consider a broadcast join for performance reason.如果更新表很小,则出于性能原因考虑广播连接。

# Approach to join with full table
# Test dataframe
tst=sqlContext.createDataFrame([('A',2),('B',2),('A',2),('A',3),('B',4),('A',2),('B',2),('c',9)],schema=("id","count"))
tst1 = sqlContext.createDataFrame([('A',4),('B',1)],schema=("id","val"))
#%%
# join with the full table
tst_j= tst.join(tst1,tst.id==tst1.id,'left')
# Filter result
tst_f = tst_j.where((F.col('count')>=F.col('val'))|(F.col('val').isNull()))

Hint: if you dont want two id columns in your result you can change the join syntax as tst.join(tst1,on="id",how='left')提示:如果您不希望结果中有两个 id 列,您可以将连接语法更改为 tst.join(tst1,on="id",how='left')

Here's what I ended up doing:这就是我最终做的事情:

val editedDF = Seq(("A",3),("B",3)).toDF("id","timestamp")
val messageDF = Seq(("A",2),("B",2),("A",2),("A",3),("B",4),("A",2),("B",2),("c",9)).toDF("id","timestamp")

Finally I used this join:最后我使用了这个连接:

    // Filter out the edited meesages.
    val editedFilteredDF  = messageDF.join(editedDF,
(editedDF("id") === messageDF("id")) && (editedDF("timestamp") > messageDF("timestamp")),
joinType="left_anti")

The result:结果:

 editedFilteredDF.show()
+---+---------+
| id|timestamp|
+---+---------+
|  A|        3|
|  B|        4|
|  c|        9|
+---+---------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM