I have a city name in unicode, and I want to match it with regex, but I also want to validate when it is a string, like "New York". I searched a little bit and tried something attached below, but could not figure out how?
I tried this regex "([\ -\]+)" on this website: http://regex101.com/#python and it works, but could not get it working in python.
Thanks in advance!!
city=u"H\u0101na"
mcity=re.search(r"([\u0000-\uFFFFA-Za-z\s]+)", city, re.U)
mcity.group(0)
u'H'
mcity=re.search(r"([\u0000-\uFFFFA-Za-z\s]+)", city, re.U)
Unlike \\x
, \\u\u003c/code> is not a special sequence in regex syntax, so your character group matches a literal backslash, letter U, and so on.
To refer to non-ASCII in a regex you have to include them as raw characters in a Unicode string, for example as:
mcity=re.search(u"([\u0000-\uFFFFA-Za-z\\s]+)", city, re.U)
(If you don't want to double-backslash the
\\s
, you could also use a ur
string, in which \\u\u003c/code> still works as an escape but the other escapes like
\\x
don't. This is a bit confusing though.)
This character group is redundant: including the range U+0000 to U+FFFF already covers all of
A-Za-z\\s
, and indeed the whole Basic Multilingual Plane including control characters. On a narrow build of Python (including Windows Python 2 builds), where the characters outside the BMP are represented using surrogate pairs in the range U+D800 to U+DFFF, you are actually allowing every single character, so it's not much of a filter. ( .+
would be a simpler way of putting it.)
Then again it's pretty difficult to express what might constitute a valid town name in different parts of the world. I'd be tempted to accept anything that, shorn of control characters and leading/trailing whitespace, wasn't an empty string.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.