简体   繁体   中英

how to match a string from a list of strings and ignoring regex special characters?

I have this string:

d = {'col1': ['Digital Forms - how to spousal information on DF 2,0']}

I turned it into a dataframe:

df = pd.DataFrame(d)

From this dataframe, I want to match this list of words:

wordlist = ['Digital Forms', 'how', 'spousal', 'DF 2.0']

I used the findall function with some regex to return my list:

words =  df['col1'].str.findall(r"\b("+'|'.join(wordlist)+r")\b", flags=re.IGNORECASE)

This was the result:

[Digital Forms, how, spousal, DF 2,0]

I want to get rid of DF 2,0 as it is not supposed to be part of the result. I know in regex the dot (.) is a special character used to match any character. In this case the dot in DF 2.0 is used to match DF 2,0 . I tried to modify my script and include something like '\\.' to ignore the special character of the dot. Nothing worked for me.

Can someone help me modify the following so it ignores the special character of the dot?

'df['col1'].str.findall(r"\b("+'|'.join(wordlist)+r")\b", flags=re.IGNORECASE)'

You may form a regex alternation from your word list using re.escape to escape the metacharacters:

wordlist = ['Digital Forms', 'how', 'spousal', 'DF 2.0']
regex = r'\b(' + '|'.join([re.escape(x) for x in wordlist]) + r')\b'
words = df['col1'].str.findall(regex, flags=re.IGNORECASE)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM