简体   繁体   English

“u”和“r”字符串前缀究竟做了什么,什么是原始字符串文字?

[英]What exactly do "u" and "r" string prefixes do, and what are raw string literals?

While asking this question , I realized I didn't know much about raw strings.在问这个问题时,我意识到我对原始字符串知之甚少。 For somebody claiming to be a Django trainer, this sucks.对于自称是 Django 培训师的人来说,这很糟糕。

I know what an encoding is, and I know what u'' alone does since I get what is Unicode.我知道编码是什么,而且我知道u''一个人做什么,因为我知道什么是 Unicode。

  • But what does r'' do exactly?但是r''到底做了什么? What kind of string does it result in?它会产生什么样的字符串?

  • And above all, what the heck does ur'' do?最重要的是,你到底在做ur''

  • Finally, is there any reliable way to go back from a Unicode string to a simple raw string?最后,是否有任何可靠的方法可以将 go 从 Unicode 字符串返回到简单的原始字符串?

  • Ah, and by the way, if your system and your text editor charset are set to UTF-8, does u'' actually do anything?啊,顺便说一下,如果您的系统和文本编辑器字符集设置为u'' ,您真的会做任何事情吗?

There's not really any "raw string ";没有真正的“原始字符串”; there are raw string literals , which are exactly the string literals marked by an 'r' before the opening quote.有原始字符串文字,它们正是在开始引号之前用'r'标记的字符串文字。

A "raw string literal" is a slightly different syntax for a string literal, in which a backslash, \\ , is taken as meaning "just a backslash" (except when it comes right before a quote that would otherwise terminate the literal) -- no "escape sequences" to represent newlines, tabs, backspaces, form-feeds, and so on. “原始字符串文字”与字符串文字的语法略有不同,其中反斜杠\\被视为“只是一个反斜杠”(除非它正好位于引号之前,否则会终止文字)-没有“转义序列”来表示换行符、制表符、退格符、换页符等。 In normal string literals, each backslash must be doubled up to avoid being taken as the start of an escape sequence.在正常的字符串文字中,每个反斜杠都必须加倍以避免被视为转义序列的开始。

This syntax variant exists mostly because the syntax of regular expression patterns is heavy with backslashes (but never at the end, so the "except" clause above doesn't matter) and it looks a bit better when you avoid doubling up each of them -- that's all.这种语法变体的存在主要是因为正则表达式模式的语法带有大量反斜杠(但从来没有放在最后,所以上面的“except”子句无关紧要)并且当您避免将它们中的每一个都加倍时看起来会好一些 - - 就这样。 It also gained some popularity to express native Windows file paths (with backslashes instead of regular slashes like on other platforms), but that's very rarely needed (since normal slashes mostly work fine on Windows too) and imperfect (due to the "except" clause above).表达本机 Windows 文件路径(使用反斜杠而不是其他平台上的常规斜杠)也很受欢迎,但这很少需要(因为正常的斜杠在 Windows 上也能正常工作)并且不完美(由于“except”子句)以上)。

r'...' is a byte string (in Python 2.*), ur'...' is a Unicode string (again, in Python 2.*), and any of the other three kinds of quoting also produces exactly the same types of strings (so for example r'...' , r'''...''' , r"..." , r"""...""" are all byte strings, and so on). r'...'是一个字节字符串(在 Python 2.* 中), ur'...'是一个 Unicode 字符串(同样,在 Python 2.* 中),并且其他三种引用中的任何一种也可以精确地产生相同类型的字符串(例如r'...'r'''...'''r"..."r"""..."""都是字节字符串,并且很快)。

Not sure what you mean by "going back " - there is no intrinsically back and forward directions, because there's no raw string type , it's just an alternative syntax to express perfectly normal string objects, byte or unicode as they may be.不确定您所说的“返回”是什么意思 - 本质上没有前后方向,因为没有原始字符串类型,它只是表达完全正常的字符串对象、字节或 unicode 的替代语法,因为它们可能是。

And yes, in Python 2.*, u'...' is of course always distinct from just '...' -- the former is a unicode string, the latter is a byte string.是的,在Python 2 *, u'...'当然总是从不同的只是'...' -前者是一个unicode字符串,后者是一个字节的字符串。 What encoding the literal might be expressed in is a completely orthogonal issue.文字可以用什么编码表示是一个完全正交的问题。

Eg, consider (Python 2.6):例如,考虑(Python 2.6):

>>> sys.getsizeof('ciao')
28
>>> sys.getsizeof(u'ciao')
34

The Unicode object of course takes more memory space (very small difference for a very short string, obviously ;-). Unicode 对象当然需要更多的内存空间(对于很短的字符串来说差别很小,显然 ;-)。

There are two types of string in python: the traditional str type and the newer unicode type. python中有两种类型的字符串:传统的str类型和较新的unicode类型。 If you type a string literal without the u in front you get the old str type which stores 8-bit characters, and with the u in front you get the newer unicode type that can store any Unicode character.如果你输入一个没有u前面的字符串文字,你会得到旧的str类型,它存储 8 位字符,而u前面你会得到可以存储任何 Unicode 字符的较新的unicode类型。

The r doesn't change the type at all, it just changes how the string literal is interpreted. r根本不改变类型,它只是改变字符串文字的解释方式。 Without the r , backslashes are treated as escape characters.没有r ,反斜杠被视为转义字符。 With the r , backslashes are treated as literal.使用r ,反斜杠被视为文字。 Either way, the type is the same.无论哪种方式,类型都是相同的。

ur is of course a Unicode string where backslashes are literal backslashes, not part of escape codes. ur当然是一个 Unicode 字符串,其中反斜杠是文字反斜杠,而不是转义码的一部分。

You can try to convert a Unicode string to an old string using the str() function, but if there are any unicode characters that cannot be represented in the old string, you will get an exception.您可以尝试使用str()函数将 Unicode 字符串转换为旧字符串,但是如果旧字符串中存在无法表示的任何 unicode 字符,则会出现异常。 You could replace them with question marks first if you wish, but of course this would cause those characters to be unreadable.如果您愿意,您可以先用问号替换它们,但这当然会导致这些字符不可读。 It is not recommended to use the str type if you want to correctly handle unicode characters.如果要正确处理 unicode 字符,不建议使用str类型。

'raw string' means it is stored as it appears. 'raw string'意味着它在它出现时被存储。 For example, '\\' is just a backslash instead of an escaping .例如, '\\'只是一个反斜杠而不是转义.

A "u" prefix denotes the value has type unicode rather than str . “u”前缀表示该值的类型为unicode而不是str

Raw string literals, with an "r" prefix, escape any escape sequences within them, so len(r"\\n") is 2. Because they escape escape sequences, you cannot end a string literal with a single backslash: that's not a valid escape sequence (eg r"\\" ).带有“r”前缀的原始字符串文字转义其中的任何转义序列,因此len(r"\\n")是 2。因为它们转义序列,您不能用单个反斜杠结束字符串文字:那不是一个有效的转义序列(例如r"\\" )。

"Raw" is not part of the type, it's merely one way to represent the value. “原始”不是类型的一部分,它只是表示值的一种方式。 For example, "\\\\n" and r"\\n" are identical values, just like 32 , 0x20 , and 0b100000 are identical.例如, "\\\\n"r"\\n"是相同的值,就像320x200b100000是相同的。

You can have unicode raw string literals:您可以使用 unicode 原始字符串文字:

>>> u = ur"\n"
>>> print type(u), len(u)
<type 'unicode'> 2

The source file encoding just determines how to interpret the source file, it doesn't affect expressions or types otherwise.源文件编码仅决定如何解释源文件,否则不会影响表达式或类型。 However, it's recommended to avoid code where an encoding other than ASCII would change the meaning:但是,建议避免使用 ASCII 以外的编码会改变含义的代码:

Files using ASCII (or UTF-8, for Python 3.0) should not have a coding cookie.使用 ASCII(或 UTF-8,对于 Python 3.0)的文件不应有编码 cookie。 Latin-1 (or UTF-8) should only be used when a comment or docstring needs to mention an author name that requires Latin-1;仅当评论或文档字符串需要提及需要 Latin-1 的作者姓名时才应使用 Latin-1(或 UTF-8); otherwise, using \\x, \\u or \\U escapes is the preferred way to include non-ASCII data in string literals.否则,使用 \\x、\\u 或 \\U 转义是在字符串文字中包含非 ASCII 数据的首选方法。

Let me explain it simply: In python 2, you can store string in 2 different types.让我简单解释一下:在 python 2 中,您可以将字符串存储为 2 种不同的类型。

The first one is ASCII which is str type in python, it uses 1 byte of memory.第一个是ASCII ,它是 python 中的str类型,它使用 1 个字节的内存。 (256 characters, will store mostly English alphabets and simple symbols) (256 个字符,将主要存储英文字母和简单符号)

The 2nd type is UNICODE which is unicode type in python.第二种类型是UNICODE ,它是 python 中的unicode类型。 Unicode stores all types of languages. Unicode 存储所有类型的语言。

By default, python will prefer str type but if you want to store string in unicode type you can put u in front of the text like u'text' or you can do this by calling unicode('text')默认情况下,python 会更喜欢str类型,但如果你想以unicode类型存储字符串,你可以把u放在文本前面,比如u'text'或者你可以通过调用unicode('text')

So u is just a short way to call a function to cast str to unicode .所以u只是调用函数将str 转换unicode 的一种简短方法。 That's it!就是这样!

Now the r part, you put it in front of the text to tell the computer that the text is raw text, backslash should not be an escaping character.现在r部分,你把它放在文本前面告诉计算机文本是原始文本,反斜杠不应该是转义字符。 r'\\n' will not create a new line character. r'\\n'不会创建新行字符。 It's just plain text containing 2 characters.它只是包含 2 个字符的纯文本。

If you want to convert str to unicode and also put raw text in there, use ur because ru will raise an error.如果要将str转换为unicode并将原始文本放入其中,请使用ur因为ru会引发错误。

NOW, the important part:现在,重要的部分:

You cannot store one backslash by using r , it's the only exception.您不能使用r存储一个反斜杠,这是唯一的例外。 So this code will produce error: r'\\'所以这段代码会产生错误: r'\\'

To store a backslash (only one) you need to use '\\\\'要存储反斜杠(只有一个),您需要使用'\\\\'

If you want to store more than 1 characters you can still use r like r'\\\\' will produce 2 backslashes as you expected.如果您想存储 1 个以上的字符,您仍然可以使用r,就像r'\\\\'会产生 2 个反斜杠,如您所料。

I don't know the reason why r doesn't work with one backslash storage but the reason isn't described by anyone yet.我不知道r不能与一个反斜杠存储一起使用的原因,但尚未有人描述原因。 I hope that it is a bug.我希望这是一个错误。

Unicode string literals Unicode 字符串文字

Unicode string literals (string literals prefixed by u ) are no longer used in Python 3. They are still valid but just for compatibility purposes with Python 2. Unicode 字符串文字(以u为前缀的字符串文字)在 Python 3 中不再使用。它们仍然有效,但只是为了与 Python 2兼容

Raw string literals原始字符串文字

If you want to create a string literal consisting of only easily typable characters like english letters or numbers, you can simply type them: 'hello world' .如果您想创建一个仅由易于输入的字符(如英文字母或数字)组成的字符串文字,您只需键入它们: 'hello world' But if you want to include also some more exotic characters, you'll have to use some workaround.但是,如果您还想包含一些更奇特的字符,则必须使用一些解决方法。 One of the workarounds are Escape sequences .一种解决方法是转义序列 This way you can for example represent a new line in your string simply by adding two easily typable characters \\n to your string literal.通过这种方式,您可以例如通过向字符串文字添加两个易于键入的字符\\n来表示字符串中的新行。 So when you print the 'hello\\nworld' string, the words will be printed on separate lines.因此,当您打印'hello\\nworld'字符串时,单词将打印在单独的行上。 That's very handy!这很方便!

On the other hand, there are some situations when you want to create a string literal that contains escape sequences but you don't want them to be interpreted by Python.另一方面,在某些情况下,您想要创建一个包含转义序列的字符串文字,但又不希望它们被 Python 解释。 You want them to be raw .你希望它们是生的 Look at these examples:看看这些例子:

'New updates are ready in c:\windows\updates\new'
'In this lesson we will learn what the \n escape sequence does.'

In such situations you can just prefix the string literal with the r character like this: r'hello\\nworld' and no escape sequences will be interpreted by Python.在这种情况下,您可以像这样使用r字符作为字符串文字的前缀: r'hello\\nworld'并且 Python 不会解释任何转义序列。 The string will be printed exactly as you created it.该字符串将完全按照您创建的方式打印。

Raw string literals are not completely "raw"?原始字符串文字不是完全“原始”的?

Many people expect the raw string literals to be raw in a sense that "anything placed between the quotes is ignored by Python" .许多人希望原始字符串文字在某种意义上是原始的,即“Python 忽略放在引号之间的任何内容” That is not true.那不是真的。 Python still recognizes all the escape sequences, it just does not interpret them - it leaves them unchanged instead. Python 仍然可以识别所有的转义序列,只是不解释它们——而是让它们保持不变。 It means that raw string literals still have to be valid string literals .这意味着原始字符串文字仍然必须是有效的字符串文字

From the lexical definition of a string literal:从字符串文字的词法定义

string     ::=  "'" stringitem* "'"
stringitem ::=  stringchar | escapeseq
stringchar ::=  <any source character except "\" or newline or the quote>
escapeseq  ::=  "\" <any source character>

It is clear that string literals (raw or not) containing a bare quote character: 'hello'world' or ending with a backslash: 'hello world\\' are not valid.很明显,包含裸引号字符的字符串文字(原始或非原始): 'hello'world'或以反斜杠结尾: 'hello world\\'是无效的。

Maybe this is obvious, maybe not, but you can make the string '\\' by calling x=chr(92)也许这很明显,也许不是,但是您可以通过调用x=chr(92)来生成字符串'\\'

x=chr(92)
print type(x), len(x) # <type 'str'> 1
y='\\'
print type(y), len(y) # <type 'str'> 1
x==y   # True
x is y # False

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM