简体   繁体   中英

How do I get this regular expression to ignore accented characters?

I have a text file that my script is reading and getting the most frequent words from. However, at one point in the process of doing that, during the clean-up of the source text, it cannot handle accented characters (in this case, they are áéíóöőúüű).

This is what I have at the moment.

str = re.sub(r'\W+', ' ', str)

This simply deletes the accented characters. I have tried adding flags=re.U , but it just messed up the result in a different way. I suspect there is a simple way to solve my problem and I have looked for it, but haven't been successful and so I turn to you. Thanks in advance.

You need to use the right modifier:

str = re.sub(ur'\W+', u' ', s, flags=re.UNICODE)
                                     ^^^^^^^^^^

See Python 2.x docs :

Make the \\w , \\W , \\b , \\B , \\d , \\D , \\s and \\S sequences dependent on the Unicode character properties database. Also enables non-ASCII matching for IGNORECASE .

Here is an online Python 2.7 demo :

import re
s = u"characters (in this case, they are áéíóöőúüű)."
res = re.sub(ur'\W+', u' ', s, flags=re.UNICODE).encode("utf8")
print(res) # => characters in this case they are áéíóöőúüű 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM