![](/img/trans.png)
[英]Check for words from list and remove those words in pandas dataframe column
[英]Remove those rows from a pyspark dataframe whose entries from a column are not present in a dictionary's list of keys
我是pyspark的新手,我有一個pyspark dataframe如下:
+-----------+
|C1| C2 | c3|
+-----------+
|A |0 | 1 |
|C |0 | 1 |
|A |1 | 0 |
|B |0 | 0 |
+-----------+
我還有另一個 python 字典如下:
my_dict = {"A" : "5", "B" : "10"} # Not there is no entry with key 'C' here
我想確保的是,我的 dataframe 只保留那些C1列的值作為鍵出現在字典 my_dict 中的行。 output 應該有點像這樣:
+-----------+
|C1| C2 | c3|
+-----------+
|A |0 | 1 |
|A |1 | 0 |
|B |0 | 0 |
+-----------+
編輯:C1 列條目比上面描述的要復雜一些。 雖然是字符串,但是有不少特殊字符。 像這樣:
A : www.A.com || u-a : Mozilla/5.0 (iPhone; CPU iPhone OS 8_4_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Mobile/12H321 [FBAN/FBIOS;FBAV/163.0.0.54.96;FBBV/96876057;FBDV/iPhone7,1;FBMD/iPhone;FBSN/iPhone OS;FBSV/8.4.1;FBSS/3;FBCR/MEO;FBID/phone;FBLC/pt_PT;FBOP/5;FBRV/98697066] || C : none || accept-encoding : gzip, deflate, br || accept-language : en-US,en;q=0.9=223
上面的字符串也用作字典中的鍵。
您可以嘗試使用以下語法
input_data=[['A',0,1],['C',0,1],['A',1,0],['B',0,0]]
my_dict = {"A" : "5", "B" : "10"}
data = spark.createDataFrame(input_data)
input_key_list=[key for key in my_dict.keys()]
from pyspark.sql.functions import col
data.where(col("_1").isin(input_key_list)).show()
另一種方法可能是 -
input_data=[['A',0,1],['C',0,1],['A',1,0],['B',0,0]]
input_data_columns = ["c1","c2","c3"]
my_dict = {"A" : "5", "B" : "10"}
input_key_list=[key for key in my_dict.keys()]
from pyspark.sql.types import IntegerType,StringType
keys_data=spark.createDataFrame(input_key_list, StringType())
data = spark.createDataFrame(input_data,schema=input_data_columns)
keys_data.join(data,data.c1 == keys_data.value,"inner").select("c1","c2","c3").show(truncate=False)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.