简体   繁体   English

如何在python中用单个反斜杠替换双反斜杠?

[英]How to replace a double backslash with a single backslash in python?

I have a string.我有一个字符串。 In that string are double backslashes.在那个字符串中是双反斜杠。 I want to replace the double backslashes with single backslashes, so that unicode char codes can be parsed correctly.我想用单反斜杠替换双反斜杠,以便可以正确解析 unicode 字符代码。

(Pdb) p fetched_page
'<p style="text-align:center;" align="center"><strong><span style="font-family:\'Times New Roman\', serif;font-size:115%;">Chapter 0<\\/span><\\/strong><\\/p>\n<p><span style="font-family:\'Times New Roman\', serif;font-size:115%;">Chapter 0 in \\u201cDreaming in Code\\u201d give a brief description of programming in its early years and how and why programmers are still struggling today...'

Inside of this string, you can see escaped unicode character codes, such as:在此字符串中,您可以看到转义的 unicode 字符代码,例如:

\\u201c

I want to turn this into:我想把它变成:

\u201c

Attempt 1:尝试 1:

fetched_page.replace('\\\\', '\\')

but this doesn't work -- it searches for quadruple backslashes.但这不起作用——它搜索四重反斜杠。

Attempt 2:尝试 2:

fetched_page.replace('\\', '\')

But this results in an end of line error.但这会导致行尾错误。

Attempt 3:尝试 3:

fetched_page.decode('string_escape')

But this had no effect on the text.但这对文本没有影响。 All the double backslashes remained as double backslashes.所有双反斜杠都保留为双反斜杠。

您可以尝试codecs.escape_decode ,这应该解码转义序列。

Python3:蟒蛇3:

>>> b'\\u201c'.decode('unicode_escape')
'“'

or或者

>>> '\\u201c'.encode().decode('unicode_escape')
'“'

I'm not getting the behaviour you describe:我没有得到你描述的行为:

>>> x = "\\\\\\\\"
>>> print x
\\\\
>>> y = x.replace('\\\\', '\\')
>>> print y
\\

When you see '\\\\\\\\' in your output, you're seeing twice as many slashes as there are in the string because each on is escaped.当您在输出中看到'\\\\\\\\'时,您看到的斜杠数量是字符串中斜杠数量的两倍,因为每个斜杠都被转义了。 The code you wrote should work fine.您编写的代码应该可以正常工作。 Trying print ing out the actual values, instead of only looking at how the REPL displays them.尝试print出实际值,而不是只查看 REPL 如何显示它们。

为了扩展 Jeremy 的回答,您的问题是'\\'是非法字符串,因为\\'转义引号,因此您的字符串永远不会终止。

It may be slightly overkill, but...这可能有点矫枉过正,但是......

>>> import re
>>> a = '\\u201c\\u3012'
>>> re.sub(r'\\u[0-9a-fA-F]{4}', lambda x:eval('"' + x.group() + '"'), a)
'“〒'

So yeah, the simplest solution would ms4py's answer, calling codecs.escape_decode on the string and taking the result (or the first element of the result if escape_decode returns a tuple as it seems to in Python 3).所以,是的,最简单的解决方案是 ms4py 的答案,在字符串上调用codecs.escape_decode并获取结果(或结果的第一个元素,如果escape_decode返回一个元组,就像在 Python 3 中一样)。 In Python 3 you'd want to use codecs.unicode_escape_decode when working with strings (as opposed to bytes objects), though.不过,在 Python 3 中,您可能希望在处理字符串(而不是字节对象)时使用codecs.unicode_escape_decode

Interesting question, but in reality, you have only one slash symbol.有趣的问题,但实际上,您只有一个斜杠符号。 It's just a way how it represents in python.这只是它在python中的一种表示方式。 If you make a list of symbols which string contains?如果你制作一个包含哪些字符串的符号列表? like:喜欢:

[s for s in string_object]

it shows every symbol and represents "" as "\\", but you don't have to be confused about it.它显示了每个符号并将“”表示为“\\”,但您不必对此感到困惑。 It is the single symbol actually.它实际上是单个符号。 So, in the case of my example, it's just not a double backslash.所以,在我的例子中,它不是双反斜杠。

real example:真实例子:

>>> [s for s in 'usnDu\\NgAnA{I']
['u', 's', 'n', 'D', 'u', '\\', 'N', 'g', 'A', 'n', 'A', '{', 'I']

Just print it: 只需打印它:

>>> a = '\\u201c'
>>> print a
\u201c

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM