简体   繁体   中英

Remove specific words from column using Python

The data originally is derived from PDF for doing further analysis on the data, There is an [identity] column where some the values are spelled wrong, ie it contains wrong spelling or Special characters.

Looking out to remove the Unwanted characters from the column.

Input Data:

identity

UK25463AC
ID:- UN67342OM
#ID!?
USA5673OP

Expected Output:

identity

UK25463AC
UN67342OM
NAN
USA5673OP

Script I have Tried so far:

stop_word = ['#ID!?','ID:-']
pat = '|'.join(r"\b{}\b".format(x) for x in stop_words)
df['identity'] = df['identity'].str.replace(pat, '')

So I have no clue how to handle this problem

From expected output is necessary remove words boundaries \b\b and because special regex chcarecer is added re.escape , then is used Series.replace for empty string and if only empty string to missing value:

import re
stop_words = ['#ID!?','ID:-']
pat = '|'.join(r"{}".format(re.escape(x)) for x in stop_words)
df['identity'] = df['identity'].replace(pat, '', regex=True).replace('', np.nan)
print (df)
     identity
0   UK25463AC
1   UN67342OM
2         NaN
3   USA5673OP

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM