简体   繁体   中英

How to get this kind of subset from a DataFrame in Pyspark?

For example, I have the following DataFrame

+-----+----+------+
| idx | id | type |
+-----+----+------+
|   0 | 10 | a    |
|   1 | 10 | b    |
|   2 | 20 | b    |
|   3 | 30 | a    |
+-----+----+------+

I want such a subset via following sequential steps:

  1. get all the id of the type a
    • the filtered id are 10 and 30
  2. get all the rows where the id are the same as above
    • the rows 0 , 1 and 3 are selected

The resulting subset DataFrame is:

+-----+----+------+
| idx | id | type |
+-----+----+------+
|   0 | 10 | a    |
|   1 | 10 | b    |
|   3 | 30 | a    |
+-----+----+------+

How can I implement this in pyspark ? Thanks in advance.


Another follow up question, how to implement the following.

If the step is changed to:

  1. get all the rows where the id are different than above
    • the rows 2 is selected, because only this row's id is not 10 or 30

The resulting DataFrame should be:

+-----+----+------+
| idx | id | type |
+-----+----+------+
|   2 | 20 | b    |
+-----+----+------+

You can use filter and join operation. 1.

filterDF = dataDF.filter(dataDF.type == "a")
joinedDS = dataDF.join(filterDF, on="id")

For point number 2 you can use left_anti join

joinedDS1 =  dataDF.join(joinedDS, on="id", how='left_anti')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM