简体   繁体   English

根据列表中的值过滤 pyspark dataframe

[英]Filter the pyspark dataframe based on values in list

I am fairly new to pyspark.我对 pyspark 相当陌生。 I have pyspark dataframe which has information about number of times a particular person has got message from a brand.我有 pyspark dataframe ,其中包含有关特定人从品牌获得消息的次数的信息。 It has three columns id , brand and count , as show below.它有三列idbrandcount ,如下所示。

|  id |  brand  | Count |
|:---:|:-------:|:-----:|
| 143 |  AD-ABC |   3   |
| 314 | AX-DEFG |   8   |
| 381 |  AD-ABC |   6   |
| 425 | AD-XYZP |   7   |
| 432 |  AD-GAF |   8   |
| 102 |  AD-GAF |   1   |
| 331 |  AX-ABC |   10  |
| 191 |  AD-GAF |   9   |
| 224 |  AD-GAF |   6   |

The brand column is bit complex and I want to derive new column brand2 from brand column as shown below(keep character after -)品牌列有点复杂,我想从品牌列派生新列brand2 ,如下所示(在 - 后保留字符)

+-----+---------+-------+--------+
| id  |  brand  | Count | brand2 |
+-----+---------+-------+--------+
| 143 | AD-ABC  |     3 | ABC    |
| 314 | AX-DEFG |     8 | DEFG   |
| 381 | AD-ABC  |     6 | ABC    |
| 425 | AD-XYZP |     7 | XYZP   |
| 432 | AD-GAF  |     8 | GAF    |
| 102 | AD-GAF  |     1 | GAF    |
| 331 | AX-ABC  |    10 | ABC    |
| 191 | AD-GAF  |     9 | GAF    |
| 224 | AD-GAF  |     6 | GAF    |
+-----+---------+-------+--------+

I have a very large list which has the brands that I want to filter out from the dataframe as below我有一个非常大的列表,其中包含我想从 dataframe 中过滤掉的品牌,如下所示

brand_subset = ['ABC', 'DEF', 'XYZP'] #The list is very large !!

The desired dataframe which I want is as below我想要的所需 dataframe 如下

+-----+---------+-------+--------+
| id  |  brand  | Count | brand2 |
+-----+---------+-------+--------+
| 143 | AD-ABC  |     3 | ABC    |
| 381 | AD-ABC  |     6 | ABC    |
| 425 | AD-XYZP |     7 | XYZP   |
| 331 | AX-ABC  |    10 | ABC    |
+-----+---------+-------+--------+

The above is just a sample scenario, practically both the list and the table is very large.上面只是一个示例场景,实际上列表和表格都非常大。

Any help will be appreciated.任何帮助将不胜感激。 (It will be good if the solution is optimized considering size of database) (如果考虑到数据库的大小来优化解决方案会很好)

Split the brand column and get the second element, then use isin to check if brand2 is in the list:拆分品牌列并获取第二个元素,然后使用isin检查brand2是否在列表中:

import pyspark.sql.functions as F
brand_subset = ['ABC', 'DEF', 'XYZP']

(df.withColumn("brand2",F.split("brand","-")[1]).where(F.col("brand2")
                                          .isin(brand_subset))).show()

or:或者:

(df.withColumn("brand2",F.split("brand","-")[1]).filter(F.col("brand2")
                                            .isin(brand_subset)).show()

+---+-------+-----+------+
| id|  brand|Count|brand2|
+---+-------+-----+------+
|143| AD-ABC|    3|   ABC|
|381| AD-ABC|    6|   ABC|
|425|AD-XYZP|    7|  XYZP|
|331| AX-ABC|   10|   ABC|
+---+-------+-----+------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM