I have a text file that my script is reading and getting the most frequent words from. However, at one point in the process of doing that, during the clean-up of the source text, it cannot handle accented characters (in this case, they are áéíóöőúüű).
This is what I have at the moment.
str = re.sub(r'\W+', ' ', str)
This simply deletes the accented characters. I have tried adding flags=re.U
, but it just messed up the result in a different way. I suspect there is a simple way to solve my problem and I have looked for it, but haven't been successful and so I turn to you. Thanks in advance.
You need to use the right modifier:
str = re.sub(ur'\W+', u' ', s, flags=re.UNICODE)
^^^^^^^^^^
See Python 2.x docs :
Make the
\\w
,\\W
,\\b
,\\B
,\\d
,\\D
,\\s
and\\S
sequences dependent on the Unicode character properties database. Also enables non-ASCII matching for IGNORECASE .
Here is an online Python 2.7 demo :
import re
s = u"characters (in this case, they are áéíóöőúüű)."
res = re.sub(ur'\W+', u' ', s, flags=re.UNICODE).encode("utf8")
print(res) # => characters in this case they are áéíóöőúüű
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.