Filter the pyspark dataframe based on values in list

Question

I am fairly new to pyspark. I have pyspark dataframe which has information about number of times a particular person has got message from a brand. It has three columns id , brand and count , as show below.

|  id |  brand  | Count |
|:---:|:-------:|:-----:|
| 143 |  AD-ABC |   3   |
| 314 | AX-DEFG |   8   |
| 381 |  AD-ABC |   6   |
| 425 | AD-XYZP |   7   |
| 432 |  AD-GAF |   8   |
| 102 |  AD-GAF |   1   |
| 331 |  AX-ABC |   10  |
| 191 |  AD-GAF |   9   |
| 224 |  AD-GAF |   6   |

The brand column is bit complex and I want to derive new column brand2 from brand column as shown below(keep character after -)

+-----+---------+-------+--------+
| id  |  brand  | Count | brand2 |
+-----+---------+-------+--------+
| 143 | AD-ABC  |     3 | ABC    |
| 314 | AX-DEFG |     8 | DEFG   |
| 381 | AD-ABC  |     6 | ABC    |
| 425 | AD-XYZP |     7 | XYZP   |
| 432 | AD-GAF  |     8 | GAF    |
| 102 | AD-GAF  |     1 | GAF    |
| 331 | AX-ABC  |    10 | ABC    |
| 191 | AD-GAF  |     9 | GAF    |
| 224 | AD-GAF  |     6 | GAF    |
+-----+---------+-------+--------+

I have a very large list which has the brands that I want to filter out from the dataframe as below

brand_subset = ['ABC', 'DEF', 'XYZP'] #The list is very large !!

The desired dataframe which I want is as below

+-----+---------+-------+--------+
| id  |  brand  | Count | brand2 |
+-----+---------+-------+--------+
| 143 | AD-ABC  |     3 | ABC    |
| 381 | AD-ABC  |     6 | ABC    |
| 425 | AD-XYZP |     7 | XYZP   |
| 331 | AX-ABC  |    10 | ABC    |
+-----+---------+-------+--------+

The above is just a sample scenario, practically both the list and the table is very large.

Any help will be appreciated. (It will be good if the solution is optimized considering size of database)

Answer 1

Split the brand column and get the second element, then use isin to check if brand2 is in the list:

import pyspark.sql.functions as F
brand_subset = ['ABC', 'DEF', 'XYZP']

(df.withColumn("brand2",F.split("brand","-")[1]).where(F.col("brand2")
                                          .isin(brand_subset))).show()

or:

(df.withColumn("brand2",F.split("brand","-")[1]).filter(F.col("brand2")
                                            .isin(brand_subset)).show()

+---+-------+-----+------+
| id|  brand|Count|brand2|
+---+-------+-----+------+
|143| AD-ABC|    3|   ABC|
|381| AD-ABC|    6|   ABC|
|425|AD-XYZP|    7|  XYZP|
|331| AX-ABC|   10|   ABC|
+---+-------+-----+------+

Filter the pyspark dataframe based on values in list

Question

1 answers

solution1
3 ACCPTED 2020-06-04 03:52:55

Filter the pyspark dataframe based on values in list

Question

1 answers

solution1 3 ACCPTED 2020-06-04 03:52:55

solution1
3 ACCPTED 2020-06-04 03:52:55