简体   繁体   中英

Python raw strings and unicode : how to use Web input as regexp patterns?

EDIT : This question doesn't really make sense once you have picked up what the "r" flag means. More details here . For people looking for a quick anwser, I added on below.

If I enter a regexp manually in a Python script, I can use 4 combinations of flags for my pattern strings :

  • p1 = "pattern"
  • p2 = u"pattern"
  • p3 = r"pattern"
  • p4 = ru"pattern"

I have a bunch a unicode strings coming from a Web form input and want to use them as regexp patterns.

I want to know what process I should apply to the strings so I can expect similar result from the usage of the manual form above. Something like :

import re
assert re.match(p1, some_text) == re.match(someProcess1(web_input), some_text)
assert re.match(p2, some_text) == re.match(someProcess2(web_input), some_text)
assert re.match(p3, some_text) == re.match(someProcess3(web_input), some_text)
assert re.match(p4, some_text) == re.match(someProcess4(web_input), some_text)

What would be someProcess1 to someProcessN and why ?

I suppose that someProcess2 doesn't need to do anything while someProcess1 should do some unicode conversion to the local encoding. For the raw string literals, I am clueless.

除了可能必须正确地编码Unicode(在Python 2. *中)外,不需要处理,因为没有针对“原始字符串”的特定类型 -它只是文字的语法,即字符串常量,而您不需要您的代码段中包含任何字符串常量,因此没有什么可“处理”的。

"r" flags just prevent Python from interpreting "\\" in a string. Since the Web doesn't care about what kind of data it carries, your web input will be a bunch of bytes you are free to interpret the way you want.

So to address this problem :

  • be sure you use Unicode (eg utf-8) all long the way
  • when you get the string, it will be Unicode and "\\n", "\\t" and "\\a" will be literals, so you don't need to care about if you need to escape them of not.

Note the following in your first example:

>>> p1 = "pattern"
>>> p2 = u"pattern"
>>> p3 = r"pattern"
>>> p4 = ur"pattern" # it's ur"", not ru"" btw
>>> p1 == p2 == p3 == p4
True

While these constructs look different, they all do the same thing, they create a string object (p1 and p3 a str and p2 and p4 a unicode object in Python 2.x), containing the value " pattern ". The u , r and ur just tell the parser, how to interpret the following quoted string, namely as a unicode text ( u ) and/or a raw text ( r ) where backslashes to encode other characters are ignored. However in the end it doesn't matter how a string was created, being it a raw string or not, internally it is stored the same.

When you get unicode text as input, you have to differ (in Python 2.x) if it is a unicode text or a str object. If you want to work with the unicode content, you should internally work only with those, and convert all str objects to unicode objects (either with str.decode() or with the u'text' syntax for hard-coded texts). If you however encode it to your local encoding, you will get problems with unicode symbols.

A different approach would be using Python 3, which str object supports unicode directly and stores everything as unicode and where you simply don't need to care about the encoding.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM