简体   繁体   中英

Check for words from list and remove those words in pandas dataframe column

I have a list as follows,

remove_words = ['abc', 'deff', 'pls']

The following is the data frame which I am having with column name 'string'

     data['string']

0    abc stack overflow
1    abc123
2    deff comedy
3    definitely
4    pls lkjh
5    pls1234

I want to check for words from remove_words list in the pandas dataframe column and remove those words in the pandas dataframe. I want to check for the words occurring individually without occurring with other words.

For example, if there is 'abc' in pandas df column, replace it with '' but if it occurs with abc123, we need to leave it as it is. The output here should be,

     data['string']

0    stack overflow
1    abc123
2    comedy
3    definitely
4    lkjh
5    pls1234

In my actual data, I have 2000 words in the remove_words list and 5 billion records in the pandas dataframe. So I am looking for the best efficient way to do this.

I have tried few things in python, without much success. Can anybody help me in doing this? Any ideas would be helpful.

Thanks

Try this:

In [98]: pat = r'\b(?:{})\b'.format('|'.join(remove_words))

In [99]: pat
Out[99]: '\\b(?:abc|def|pls)\\b'

In [100]: df['new'] = df['string'].str.replace(pat, '')

In [101]: df
Out[101]:
               string              new
0  abc stack overflow   stack overflow
1              abc123           abc123
2          def comedy           comedy
3          definitely       definitely
4            pls lkjh             lkjh
5             pls1234          pls1234

Totally taking @MaxU's pattern!

We can use pd.DataFrame.replace by setting the regex parameter to True and passing a dictionary of dictionaries that specifies the pattern and what to replace with for each column.

pat = '|'.join([r'\b{}\b'.format(w) for w in remove_words])

df.assign(new=df.replace(dict(string={pat: ''}), regex=True))

               string              new
0  abc stack overflow   stack overflow
1              abc123           abc123
2          def comedy           comedy
3          definitely       definitely
4            pls lkjh             lkjh
5             pls1234          pls1234

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM