简体   繁体   English

我如何在python 3.6中转换字符?

[英]How do I unescape characters in python 3.6?

I'm a little confused on how to unescape characters in python. 我对如何在python中转换角色感到困惑。 I am parsing some HTML using BeautifulSoup, and when I retrieve the text content it looks like this: 我正在使用BeautifulSoup解析一些HTML,当我检索文本内容时,它看起来像这样:

\u00a0\n\n\n\r\nState-of-the-art security and 100% uptime SLA.\u00a0\r\n\n\n\r\nOutstanding support

I'd like for it to look like this: 我希望它看起来像这样:

State-of-the-art security and 100% uptime SLA. Outstanding support

Here is my code below: 这是我的代码如下:

    self.__page = requests.get(url)
    self.__soup = BeautifulSoup(self.__page.content, "lxml")
    self.__page_cleaned = self.__removeTags(self.__page.content) #remove script and style tags
    self.__tree = html.fromstring(self.__page_cleaned) #contains the page html in a tree structure
    page_data = {}
    page_data["content"] =  self.__tree.text_content()

How do I remove those encoded backslashed characters? 如何删除那些编码的反斜杠字符? I've looked everywhere and nothing has worked for me. 我到处都看,没有什么对我有用。

You can convert those escape sequences to proper text using the codecs module. 您可以使用codecs模块将这些转义序列转换为正确的文本。

import codecs

s = r'\u00a0\n\n\n\r\nState-of-the-art security and 100% uptime SLA.\u00a0\r\n\n\n\r\nOutstanding support'

# Convert the escape sequences
z = codecs.decode(s, 'unicode-escape')
print(z)
print('- ' * 20)

# Remove the extra whitespace
print(' '.join(z.split()))       

output 产量

    [several blank lines here]
 



State-of-the-art security and 100% uptime SLA. 



Outstanding support
- - - - - - - - - - - - - - - - - - - - 
State-of-the-art security and 100% uptime SLA. Outstanding support

The codecs.decode(s, 'unicode-escape') function is quite versatile. codecs.decode(s, 'unicode-escape')功能非常通用。 It can handle simple backslash escapes, like those newline and carriage return sequences ( \\n and \\r ), but its main strength is handling Unicode escape sequences, like the , which is just a nonbreak space char. 它可以处理简单的反斜杠转义,比如换行和回车序列( \\n\\r ),但它的主要优点是处理Unicode转义序列,如 ,它只是一个非破坏空格字符。 But if your data had other Unicode escapes in it, like those for foreign alphabetic chars or emojis, it would handle them too. 但是如果你的数据中有其他的Unicode转义,比如外国字母字符或表情符号,那么它也会处理它们。


As Evpok mentions in a comment, this won't work if the text string contains actual Unicode characters as well as Unicode \\u\u003c/code> or \\U escape sequences. 正如Evpok在评论中提到的那样,如果文本字符串包含实际的Unicode字符以及Unicode \\u\u003c/code>或\\U转义序列, 则无法使用。

From the codecs docs : 编解码器文档

unicode_escape unicode_escape

Encoding suitable as the contents of a Unicode literal in ASCII-encoded Python source code, except that quotes are not escaped. 在ASCII编码的Python源代码中编码适合作为Unicode文字的内容,但引号不会被转义。 Decodes from Latin-1 source code. 从Latin-1源代码解码。 Beware that Python source code actually uses UTF-8 by default. 请注意,Python源代码默认情况下实际使用UTF-8。

Also see the docs for codecs.decode . 另请参阅codecs.decode的文档。

You could use regular expressions: 你可以使用正则表达式:

import re

s = '\u00a0\n\n\n\r\nState-of-the-art security and 100% uptime SLA.\u00a0\r\n\n\n\r\nOutstanding support'
s = ' '.join(re.findall(r"[\w%\-.']+", s))

print(s) #output: State-of-the-art security and 100% uptime SLA. Outstanding support

re.findall("exp", s) returns a list of all substrings of s which match the pattern "exp". re.findall(“exp”,s)返回与模式“exp”匹配的s的所有子字符串的列表。 In the case of "[\\w]+" all combinations of letters or numbers (no hex string like "\ "): 在“[\\ w] +”的情况下,所有字母或数字的组合(没有像“\\ u00a0”这样的十六进制字符串):

['State', 'of', 'the', 'art', 'security', 'and', '100', 'uptime', 'SLA', 'Outstanding', 'support'] 

You can include characters by adding them to the expression like so: 您可以通过将字符添加到表达式中来包含字符,如下所示:

re.findall(r"[\w%.-']+", s)    # added "%", "." and "-" ("-"needs to be escaped by "\")

' '.join(s) returns a string of all elements seperated by the string in the quotes (in this case a space). ''.join(s)返回由引号中的字符串分隔的所有元素的字符串(在本例中为空格)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM