[英]Select rows in a Data Frame where the ID must have two conditions based on two different rows in PySpark
I have a Data Frame that is structured like this:我有一个结构如下的数据框:
ID | DATE | ACTIVE | LEFT | NEW |
123 |2021-01-01| 1 | 0 | 1 |
456 |2021-03-01| 1 | 0 | 1 |
456 |2021-06-01| 1 | 1 | 0 |
479 |2020-06-01| 1 | 1 | 0 |
567 |2021-07-01| 1 | 1 | 0 |
I want to implement a query in PySpark that returns all ID's that have both a condition where "NEW == 1" and "LEFT == 1", but those conditions appear in different rows.我想在 PySpark 中实现一个查询,该查询返回所有具有“NEW == 1”和“LEFT == 1”条件的 ID,但这些条件出现在不同的行中。
So in this case, I'd like to return the following rows.所以在这种情况下,我想返回以下行。
ID | DATE | ACTIVE | LEFT | NEW |
456 |2021-03-01| 1 | 0 | 1 |
456 |2021-06-01| 1 | 1 | 0 |
Thanks in advance!提前致谢!
ps: The original dataset has over 13 million entries. ps:原始数据集有超过 1300 万个条目。
Here is a solution you can give it a try, apply filter then groupby
to identify duplicates & inner join with the original dataframe
这是一个您可以尝试的解决方案,先应用过滤器,然后再groupby
以识别重复项和与原始dataframe
内部连接
df_filter = df.filter((df.LEFT == 1) | (df.NEW == 1))
df_filter.join(
# Identify Duplicate ID's
df_filter.groupBy("ID").count().where("count > 1"), on=['ID']
).drop(*['count']).show()
+---+----------+------+----+---+
| ID| DATE|ACTIVE|LEFT|NEW|
+---+----------+------+----+---+
|456|2021-03-01| 1| 0| 1|
|456|2021-06-01| 1| 1| 0|
+---+----------+------+----+---+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.