简体   繁体   中英

Filter the pyspark dataframe based on values in list

I am fairly new to pyspark. I have pyspark dataframe which has information about number of times a particular person has got message from a brand. It has three columns id , brand and count , as show below.

|  id |  brand  | Count |
|:---:|:-------:|:-----:|
| 143 |  AD-ABC |   3   |
| 314 | AX-DEFG |   8   |
| 381 |  AD-ABC |   6   |
| 425 | AD-XYZP |   7   |
| 432 |  AD-GAF |   8   |
| 102 |  AD-GAF |   1   |
| 331 |  AX-ABC |   10  |
| 191 |  AD-GAF |   9   |
| 224 |  AD-GAF |   6   |

The brand column is bit complex and I want to derive new column brand2 from brand column as shown below(keep character after -)

+-----+---------+-------+--------+
| id  |  brand  | Count | brand2 |
+-----+---------+-------+--------+
| 143 | AD-ABC  |     3 | ABC    |
| 314 | AX-DEFG |     8 | DEFG   |
| 381 | AD-ABC  |     6 | ABC    |
| 425 | AD-XYZP |     7 | XYZP   |
| 432 | AD-GAF  |     8 | GAF    |
| 102 | AD-GAF  |     1 | GAF    |
| 331 | AX-ABC  |    10 | ABC    |
| 191 | AD-GAF  |     9 | GAF    |
| 224 | AD-GAF  |     6 | GAF    |
+-----+---------+-------+--------+

I have a very large list which has the brands that I want to filter out from the dataframe as below

brand_subset = ['ABC', 'DEF', 'XYZP'] #The list is very large !!

The desired dataframe which I want is as below

+-----+---------+-------+--------+
| id  |  brand  | Count | brand2 |
+-----+---------+-------+--------+
| 143 | AD-ABC  |     3 | ABC    |
| 381 | AD-ABC  |     6 | ABC    |
| 425 | AD-XYZP |     7 | XYZP   |
| 331 | AX-ABC  |    10 | ABC    |
+-----+---------+-------+--------+

The above is just a sample scenario, practically both the list and the table is very large.

Any help will be appreciated. (It will be good if the solution is optimized considering size of database)

Split the brand column and get the second element, then use isin to check if brand2 is in the list:

import pyspark.sql.functions as F
brand_subset = ['ABC', 'DEF', 'XYZP']

(df.withColumn("brand2",F.split("brand","-")[1]).where(F.col("brand2")
                                          .isin(brand_subset))).show()

or:

(df.withColumn("brand2",F.split("brand","-")[1]).filter(F.col("brand2")
                                            .isin(brand_subset)).show()

+---+-------+-----+------+
| id|  brand|Count|brand2|
+---+-------+-----+------+
|143| AD-ABC|    3|   ABC|
|381| AD-ABC|    6|   ABC|
|425|AD-XYZP|    7|  XYZP|
|331| AX-ABC|   10|   ABC|
+---+-------+-----+------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM