How do I get this regular expression to ignore accented characters?

Question

I have a text file that my script is reading and getting the most frequent words from. However, at one point in the process of doing that, during the clean-up of the source text, it cannot handle accented characters (in this case, they are áéíóöőúüű).

This is what I have at the moment.

str = re.sub(r'\W+', ' ', str)

This simply deletes the accented characters. I have tried adding flags=re.U , but it just messed up the result in a different way. I suspect there is a simple way to solve my problem and I have looked for it, but haven't been successful and so I turn to you. Thanks in advance.

Answer 1

You need to use the right modifier:

str = re.sub(ur'\W+', u' ', s, flags=re.UNICODE)
                                     ^^^^^^^^^^

See Python 2.x docs :

Make the \\w , \\W , \\b , \\B , \\d , \\D , \\s and \\S sequences dependent on the Unicode character properties database. Also enables non-ASCII matching for IGNORECASE .

Here is an online Python 2.7 demo :

import re
s = u"characters (in this case, they are áéíóöőúüű)."
res = re.sub(ur'\W+', u' ', s, flags=re.UNICODE).encode("utf8")
print(res) # => characters in this case they are áéíóöőúüű

How do I get this regular expression to ignore accented characters?

Question

1 answers

solution1
3 ACCPTED 2017-06-12 13:20:11

How do I get this regular expression to ignore accented characters?

Question

1 answers

solution1 3 ACCPTED 2017-06-12 13:20:11

solution1
3 ACCPTED 2017-06-12 13:20:11