How to get this kind of subset from a DataFrame in Pyspark?

Question

For example, I have the following DataFrame

+-----+----+------+
| idx | id | type |
+-----+----+------+
|   0 | 10 | a    |
|   1 | 10 | b    |
|   2 | 20 | b    |
|   3 | 30 | a    |
+-----+----+------+

I want such a subset via following sequential steps:

get all the id of the type a
- the filtered id are 10 and 30
get all the rows where the id are the same as above
- the rows 0 , 1 and 3 are selected

The resulting subset DataFrame is:

+-----+----+------+
| idx | id | type |
+-----+----+------+
|   0 | 10 | a    |
|   1 | 10 | b    |
|   3 | 30 | a    |
+-----+----+------+

How can I implement this in pyspark ? Thanks in advance.

Another follow up question, how to implement the following.

If the step is changed to:

get all the rows where the id are different than above
- the rows 2 is selected, because only this row's id is not 10 or 30

The resulting DataFrame should be:

+-----+----+------+
| idx | id | type |
+-----+----+------+
|   2 | 20 | b    |
+-----+----+------+

Answer 1

You can use filter and join operation. 1.

filterDF = dataDF.filter(dataDF.type == "a")
joinedDS = dataDF.join(filterDF, on="id")

For point number 2 you can use left_anti join

joinedDS1 =  dataDF.join(joinedDS, on="id", how='left_anti')

How to get this kind of subset from a DataFrame in Pyspark?

Question

1 answers

solution1
0 ACCPTED 2019-11-06 14:33:23

How to get this kind of subset from a DataFrame in Pyspark?

Question

1 answers

solution1 0 ACCPTED 2019-11-06 14:33:23

solution1
0 ACCPTED 2019-11-06 14:33:23