简体   繁体   English

根据 PySpark 中的两个不同行,选择数据框中的行,其中 ID 必须具有两个条件

[英]Select rows in a Data Frame where the ID must have two conditions based on two different rows in PySpark

I have a Data Frame that is structured like this:我有一个结构如下的数据框:

ID   |   DATE   |   ACTIVE   |  LEFT  |  NEW  |
123  |2021-01-01|      1     |    0   |   1   |
456  |2021-03-01|      1     |    0   |   1   |
456  |2021-06-01|      1     |    1   |   0   |
479  |2020-06-01|      1     |    1   |   0   |
567  |2021-07-01|      1     |    1   |   0   |

I want to implement a query in PySpark that returns all ID's that have both a condition where "NEW == 1" and "LEFT == 1", but those conditions appear in different rows.我想在 PySpark 中实现一个查询,该查询返回所有具有“NEW == 1”和“LEFT == 1”条件的 ID,但这些条件出现在不同的行中。

So in this case, I'd like to return the following rows.所以在这种情况下,我想返回以下行。

ID   |   DATE   |   ACTIVE   |  LEFT  |  NEW  |
456  |2021-03-01|      1     |    0   |   1   |
456  |2021-06-01|      1     |    1   |   0   |

Thanks in advance!提前致谢!

ps: The original dataset has over 13 million entries. ps:原始数据集有超过 1300 万个条目。

Here is a solution you can give it a try, apply filter then groupby to identify duplicates & inner join with the original dataframe这是一个您可以尝试的解决方案,先应用过滤器,然后再groupby以识别重复项和与原始dataframe内部连接

df_filter = df.filter((df.LEFT == 1) | (df.NEW == 1))

df_filter.join(
    # Identify Duplicate ID's
    df_filter.groupBy("ID").count().where("count > 1"), on=['ID']
).drop(*['count']).show()

+---+----------+------+----+---+
| ID|      DATE|ACTIVE|LEFT|NEW|
+---+----------+------+----+---+
|456|2021-03-01|     1|   0|  1|
|456|2021-06-01|     1|   1|  0|
+---+----------+------+----+---+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM