I have found this excellent guide: http://www.regular-expressions.info/unicode.html#category that gives some hints on how to match non letters with the following regex:
\P{L}
But this regex will consider non letters also à
encoded as U+0061 U+0300 (if I understood well). For example using regex module in python the following snippet:
all_letter_doc = regex.sub(r'\P{L}', ' ', doc)
will transform purè
in pur
In the guide is provided how to match all letters with the following:
\p{L}\p{M}*+
and in practice I need the negation of that but I do not know how to obtain it.
Since you are using Python 2.x, your r'\\P{L}'
is a byte string, while the input you have is Unicode. You need to make your pattern a Unicode string. See the PyPi regex
reference :
If neither the
ASCII
,LOCALE
norUNICODE
flag is specified, it will default toUNICODE
if the regex pattern is a Unicode string andASCII
if it's a bytestring.
Thus, you need to use ur'\\P{L}'
and a u' '
replacement pattern.
In case you want to match 1+ chars other than letters and diacritics, you will need ur'[^\\p{L}\\p{M}]+'
regex.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.