根据 PySpark 中的两个不同行，选择数据框中的行，其中 ID 必须具有两个条件

Question

I have a Data Frame that is structured like this:我有一个结构如下的数据框：

ID   |   DATE   |   ACTIVE   |  LEFT  |  NEW  |
123  |2021-01-01|      1     |    0   |   1   |
456  |2021-03-01|      1     |    0   |   1   |
456  |2021-06-01|      1     |    1   |   0   |
479  |2020-06-01|      1     |    1   |   0   |
567  |2021-07-01|      1     |    1   |   0   |

I want to implement a query in PySpark that returns all ID's that have both a condition where "NEW == 1" and "LEFT == 1", but those conditions appear in different rows.我想在 PySpark 中实现一个查询，该查询返回所有具有“NEW == 1”和“LEFT == 1”条件的 ID，但这些条件出现在不同的行中。

So in this case, I'd like to return the following rows.所以在这种情况下，我想返回以下行。

ID   |   DATE   |   ACTIVE   |  LEFT  |  NEW  |
456  |2021-03-01|      1     |    0   |   1   |
456  |2021-06-01|      1     |    1   |   0   |

Thanks in advance!提前致谢！

ps: The original dataset has over 13 million entries. ps：原始数据集有超过 1300 万个条目。

Answer 1

Here is a solution you can give it a try, apply filter then groupby to identify duplicates & inner join with the original dataframe这是一个您可以尝试的解决方案，先应用过滤器，然后再groupby以识别重复项和与原始dataframe内部连接

df_filter = df.filter((df.LEFT == 1) | (df.NEW == 1))

df_filter.join(
    # Identify Duplicate ID's
    df_filter.groupBy("ID").count().where("count > 1"), on=['ID']
).drop(*['count']).show()

+---+----------+------+----+---+
| ID|      DATE|ACTIVE|LEFT|NEW|
+---+----------+------+----+---+
|456|2021-03-01|     1|   0|  1|
|456|2021-06-01|     1|   1|  0|
+---+----------+------+----+---+

根据 PySpark 中的两个不同行，选择数据框中的行，其中 ID 必须具有两个条件

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-07-30 16:35:41

根据 PySpark 中的两个不同行，选择数据框中的行，其中 ID 必须具有两个条件

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-07-30 16:35:41

解决方案1
1 已采纳 2021-07-30 16:35:41