如何从 Pyspark 中的 DataFrame 中获取这种子集？

Question

For example, I have the following DataFrame比如我有下面的DataFrame

+-----+----+------+
| idx | id | type |
+-----+----+------+
|   0 | 10 | a    |
|   1 | 10 | b    |
|   2 | 20 | b    |
|   3 | 30 | a    |
+-----+----+------+

I want such a subset via following sequential steps:我想要通过以下顺序步骤获得这样的子集：

get all the id of the type a获取type a 的所有id
- the filtered id are 10 and 30过滤后的id是10和30
get all the rows where the id are the same as above获取id与上面相同的所有行
- the rows 0 , 1 and 3 are selected行0 , 1和3被选中

The resulting subset DataFrame is:结果子集 DataFrame 是：

+-----+----+------+
| idx | id | type |
+-----+----+------+
|   0 | 10 | a    |
|   1 | 10 | b    |
|   3 | 30 | a    |
+-----+----+------+

How can I implement this in pyspark ?如何在pyspark中实现这一点？ Thanks in advance.提前致谢。

Another follow up question, how to implement the following.另一个后续问题，如何实现以下。

If the step is changed to:如果步骤更改为：

get all the rows where the id are different than above获取id与上面不同的所有行
- the rows 2 is selected, because only this row's id is not 10 or 30第2行被选中，因为只有这一行的id不是10或30

The resulting DataFrame should be:生成的 DataFrame 应该是：

+-----+----+------+
| idx | id | type |
+-----+----+------+
|   2 | 20 | b    |
+-----+----+------+

Answer 1

You can use filter and join operation.您可以使用过滤器和连接操作。 1. 1.

filterDF = dataDF.filter(dataDF.type == "a")
joinedDS = dataDF.join(filterDF, on="id")

For point number 2 you can use left_anti join对于第 2 点，您可以使用 left_anti 加入

joinedDS1 =  dataDF.join(joinedDS, on="id", how='left_anti')

如何从 Pyspark 中的 DataFrame 中获取这种子集？

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-11-06 14:33:23

如何从 Pyspark 中的 DataFrame 中获取这种子集？

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-11-06 14:33:23

解决方案1
0 已采纳 2019-11-06 14:33:23