For example, I have the following DataFrame
+-----+----+------+
| idx | id | type |
+-----+----+------+
| 0 | 10 | a |
| 1 | 10 | b |
| 2 | 20 | b |
| 3 | 30 | a |
+-----+----+------+
I want such a subset via following sequential steps:
id
of the type
a
id
are 10 and 30id
are the same as above
0
, 1
and 3
are selectedThe resulting subset DataFrame is:
+-----+----+------+
| idx | id | type |
+-----+----+------+
| 0 | 10 | a |
| 1 | 10 | b |
| 3 | 30 | a |
+-----+----+------+
How can I implement this in pyspark
? Thanks in advance.
Another follow up question, how to implement the following.
If the step is changed to:
id
are different than above
2
is selected, because only this row's id
is not 10 or 30The resulting DataFrame should be:
+-----+----+------+
| idx | id | type |
+-----+----+------+
| 2 | 20 | b |
+-----+----+------+
You can use filter and join operation. 1.
filterDF = dataDF.filter(dataDF.type == "a")
joinedDS = dataDF.join(filterDF, on="id")
For point number 2 you can use left_anti join
joinedDS1 = dataDF.join(joinedDS, on="id", how='left_anti')
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.