简体   繁体   中英

Replace N digit numbers in a sentence with specific strings for different values of N

I have a bunch of strings in a pandas dataframe that contain numbers in them. I could the riun the below code and replace them all

df.feature_col = df.feature_col.str.replace('\d+', ' NUM ')

But what I need to do is replace any 10 digit number with a string like masked_id , any 16 digit numbers with account_number , or any three-digit numbers with yet another string, and so on.

How do I go about doing this?

PS: since my data size is less, a less optimal way is also good enough for me.

You could do a series of replacements, one for each length of number:

df.feature_col = df.feature_col.str.replace(r'\b\d{3}\b', ' 3mask ')
df.feature_col = df.feature_col.str.replace(r'\b\d{10}\b', masked_id)
df.feature_col = df.feature_col.str.replace(r'\b\d{16}\b', account_number)

Another way is replace with option regex=True with a dictionary. You can also use somewhat more relaxed match patterns (in order) than Tim's:

# test data
df = pd.DataFrame({'feature_col':['this has 1234567', 
                                  'this has 1234', 
                                  'this has 123',
                                  'this has none']})

# pattern in decreasing length order
# these of course would replace '12345' with 'ID45' :-)
df['feature_col'] = df.feature_col.replace({'\d{7}': 'ID7',
                                            '\d{4}': 'ID4',   
                                            '\d{3}': 'ID3'}, 
                                           regex=True)

Output:

     feature_col
0   this has ID7
1   this has ID4
2   this has ID3
3  this has none
                                          

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM