简体   繁体   中英

python regex with unicode to match a city name

I have a city name in unicode, and I want to match it with regex, but I also want to validate when it is a string, like "New York". I searched a little bit and tried something attached below, but could not figure out how?

I tried this regex "([\-\￿]+)" on this website: http://regex101.com/#python and it works, but could not get it working in python.

Thanks in advance!!

city=u"H\u0101na"
mcity=re.search(r"([\u0000-\uFFFFA-Za-z\s]+)", city, re.U)
mcity.group(0)
u'H'
mcity=re.search(r"([\u0000-\uFFFFA-Za-z\s]+)", city, re.U)

Unlike \\x , \\u\u003c/code> is not a special sequence in regex syntax, so your character group matches a literal backslash, letter U, and so on.

To refer to non-ASCII in a regex you have to include them as raw characters in a Unicode string, for example as:

mcity=re.search(u"([\u0000-\uFFFFA-Za-z\\s]+)", city, re.U)

(If you don't want to double-backslash the \\s , you could also use a ur string, in which \\u\u003c/code> still works as an escape but the other escapes like \\x don't. This is a bit confusing though.)

This character group is redundant: including the range U+0000 to U+FFFF already covers all of A-Za-z\\s , and indeed the whole Basic Multilingual Plane including control characters. On a narrow build of Python (including Windows Python 2 builds), where the characters outside the BMP are represented using surrogate pairs in the range U+D800 to U+DFFF, you are actually allowing every single character, so it's not much of a filter. ( .+ would be a simpler way of putting it.)

Then again it's pretty difficult to express what might constitute a valid town name in different parts of the world. I'd be tempted to accept anything that, shorn of control characters and leading/trailing whitespace, wasn't an empty string.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM