![](/img/trans.png)
[英]Python Pandas partial match of list of string in dataframe and return all match partial string
[英]Python Pandas partial match of list of string in dataframe
大家好,我正在嘗試在數據框中的列中匹配部分字符串並返回匹配字符串(大寫字母)。我沒有很強的編程知識,我剛剛開始學習。
#list of State
state_abbrv = ["AL","AK","AZ","AR","CA","CO","CT","DE","FL","GA","HI","ID","IL","IN","IA","KS","KY","LA",
"ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC","ND","OH","OK",
"OR","PA","RI","SC","SD","TN","TX","UT","VT","VA","WA","WV","WI","WY"]
#Create dataframe
d = {"Index": [1, 2, 3, 4, 5 , 6, 7], "Description": ["ABNY", "MANY", "NYNY","DO", "nyNY", ""CWARD NY", "HOWARD BEACH NY"]}
df = pd.DataFrame(data=d)
這是df:
Index Description
1 ABNY
2 MANY
3 NYNY
4 DO
5 nyNY
6 CWARD NY
7 HOWARD BEACH NY
這是我的代碼:
df = df.assign(State = df["Description"].str.findall(state_abbrv))
這是預期的結果:
Index Description State
1 ABNY NY
2 MANY MA,NY
3 NYNY NY,NY
4 DO
5 nyNY NY
6 CWARD NY WA,NY
7 HOWARD BEACH NY WA,AR,NY
謝謝
您可以嘗試使用join
,然后使用str.findall
:
statesjoin='|'.join(state_abbrv)
df=df.assign(State = df["Description"].str.findall(statesjoin))
Output:
df
Index Description State
0 1 ABNY [NY]
1 2 MANY [MA, NY]
2 3 NYNY [NY, NY]
3 4 DO []
4 5 nyNY [NY]
5 6 ABALBB [AL]
6 7 ALCA [AL, CA]
在@AkshaySehgal 描述的可能情況下,您可以試試這個:
import re
df=df.assign(State = df["Description"].apply(lambda x: ','.join(re.findall('..',x))).str.findall(statesjoin))
而不是將所有 state 縮寫組合成一個字符串並使用它們(如果某些縮寫以相似字符結尾並以相似字符開頭,則會產生不正確的結果),您可以使用它 -
def get_common(s):
parts = set(map(''.join, zip(*[iter(s)]*2))) #Break string into 2 length tokens
common = ', '.join(list(parts.intersection(set(state_abbrv)))) #intersection between tokens and abbrevations
return common
df['State'] = df['Description'].apply(get_common)
Index Description State
1 ABNY NY
2 MANY MA,NY
3 NYNY NY,NY
4 DO
5 nyNY NY
6 ABALBB AL
7 ALCA AL,CA
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.