Regex match all non letters excluding diacritics (python)

Question

I have found this excellent guide: http://www.regular-expressions.info/unicode.html#category that gives some hints on how to match non letters with the following regex:

\P{L}

But this regex will consider non letters also à encoded as U+0061 U+0300 (if I understood well). For example using regex module in python the following snippet:

all_letter_doc = regex.sub(r'\P{L}', ' ', doc)

will transform purè in pur

In the guide is provided how to match all letters with the following:

\p{L}\p{M}*+

and in practice I need the negation of that but I do not know how to obtain it.

Answer 1

Since you are using Python 2.x, your r'\\P{L}' is a byte string, while the input you have is Unicode. You need to make your pattern a Unicode string. See the PyPi regex reference :

If neither the ASCII , LOCALE nor UNICODE flag is specified, it will default to UNICODE if the regex pattern is a Unicode string and ASCII if it's a bytestring.

Thus, you need to use ur'\\P{L}' and a u' ' replacement pattern.

In case you want to match 1+ chars other than letters and diacritics, you will need ur'[^\\p{L}\\p{M}]+' regex.

Regex match all non letters excluding diacritics (python)

Question

1 answers

solution1
5 ACCPTED 2016-07-24 21:52:32

Regex match all non letters excluding diacritics (python)

Question

1 answers

solution1 5 ACCPTED 2016-07-24 21:52:32

solution1
5 ACCPTED 2016-07-24 21:52:32