[英]How to get this kind of subset from a DataFrame in Pyspark?
For example, I have the following DataFrame比如我有下面的DataFrame
+-----+----+------+
| idx | id | type |
+-----+----+------+
| 0 | 10 | a |
| 1 | 10 | b |
| 2 | 20 | b |
| 3 | 30 | a |
+-----+----+------+
I want such a subset via following sequential steps:我想要通过以下顺序步骤获得这样的子集:
id
of the type
atype
a 的所有id
id
are 10 and 30id
是10和30id
are the same as aboveid
与上面相同的所有行
0
, 1
and 3
are selected0
, 1
和3
被选中The resulting subset DataFrame is:结果子集 DataFrame 是:
+-----+----+------+
| idx | id | type |
+-----+----+------+
| 0 | 10 | a |
| 1 | 10 | b |
| 3 | 30 | a |
+-----+----+------+
How can I implement this in pyspark
?如何在
pyspark
中实现这一点? Thanks in advance.提前致谢。
Another follow up question, how to implement the following.另一个后续问题,如何实现以下。
If the step is changed to:如果步骤更改为:
id
are different than aboveid
与上面不同的所有行
2
is selected, because only this row's id
is not 10 or 302
行被选中,因为只有这一行的id
不是10或30 The resulting DataFrame should be:生成的 DataFrame 应该是:
+-----+----+------+
| idx | id | type |
+-----+----+------+
| 2 | 20 | b |
+-----+----+------+
You can use filter and join operation.您可以使用过滤器和连接操作。 1.
1.
filterDF = dataDF.filter(dataDF.type == "a")
joinedDS = dataDF.join(filterDF, on="id")
For point number 2 you can use left_anti join对于第 2 点,您可以使用 left_anti 加入
joinedDS1 = dataDF.join(joinedDS, on="id", how='left_anti')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.