我如何在python 3.6中转换字符？

Question

我对如何在python中转换角色感到困惑。 我正在使用BeautifulSoup解析一些HTML，当我检索文本内容时，它看起来像这样：

\u00a0\n\n\n\r\nState-of-the-art security and 100% uptime SLA.\u00a0\r\n\n\n\r\nOutstanding support

我希望它看起来像这样：

State-of-the-art security and 100% uptime SLA. Outstanding support

这是我的代码如下：

    self.__page = requests.get(url)
    self.__soup = BeautifulSoup(self.__page.content, "lxml")
    self.__page_cleaned = self.__removeTags(self.__page.content) #remove script and style tags
    self.__tree = html.fromstring(self.__page_cleaned) #contains the page html in a tree structure
    page_data = {}
    page_data["content"] =  self.__tree.text_content()

如何删除那些编码的反斜杠字符？ 我到处都看，没有什么对我有用。

Answer 1

您可以使用codecs模块将这些转义序列转换为正确的文本。

import codecs

s = r'\u00a0\n\n\n\r\nState-of-the-art security and 100% uptime SLA.\u00a0\r\n\n\n\r\nOutstanding support'

# Convert the escape sequences
z = codecs.decode(s, 'unicode-escape')
print(z)
print('- ' * 20)

# Remove the extra whitespace
print(' '.join(z.split()))

产量

    [several blank lines here]
 



State-of-the-art security and 100% uptime SLA. 



Outstanding support
- - - - - - - - - - - - - - - - - - - - 
State-of-the-art security and 100% uptime SLA. Outstanding support

codecs.decode(s, 'unicode-escape')功能非常通用。 它可以处理简单的反斜杠转义，比如换行和回车序列（ \\n和\\r ），但它的主要优点是处理Unicode转义序列，如\ ，它只是一个非破坏空格字符。 但是如果你的数据中有其他的Unicode转义，比如外国字母字符或表情符号，那么它也会处理它们。

正如Evpok在评论中提到的那样，如果文本字符串包含实际的Unicode字符以及Unicode \\u\u003c/code>或\\U转义序列，则无法使用。

从编解码器文档：

unicode_escape

在ASCII编码的Python源代码中编码适合作为Unicode文字的内容，但引号不会被转义。从Latin-1源代码解码。请注意，Python源代码默认情况下实际使用UTF-8。

另请参阅codecs.decode的文档。

Answer 2

你可以使用正则表达式：

import re

s = '\u00a0\n\n\n\r\nState-of-the-art security and 100% uptime SLA.\u00a0\r\n\n\n\r\nOutstanding support'
s = ' '.join(re.findall(r"[\w%\-.']+", s))

print(s) #output: State-of-the-art security and 100% uptime SLA. Outstanding support

re.findall（“exp”，s）返回与模式“exp”匹配的s的所有子字符串的列表。 在“[\\ w] +”的情况下，所有字母或数字的组合（没有像“\\ u00a0”这样的十六进制字符串）：

['State', 'of', 'the', 'art', 'security', 'and', '100', 'uptime', 'SLA', 'Outstanding', 'support']

您可以通过将字符添加到表达式中来包含字符，如下所示：

re.findall(r"[\w%.-']+", s)    # added "%", "." and "-" ("-"needs to be escaped by "\")

''.join（s）返回由引号中的字符串分隔的所有元素的字符串（在本例中为空格）。

我如何在python 3.6中转换字符？

问题描述

2 个解决方案

解决方案1
2 2017-11-03 22:12:36

解决方案2
1 已采纳 2017-11-03 21:58:47

我如何在python 3.6中转换字符？

问题描述

2 个解决方案

解决方案1 2 2017-11-03 22:12:36

解决方案2 1 已采纳 2017-11-03 21:58:47

解决方案1
2 2017-11-03 22:12:36

解决方案2
1 已采纳 2017-11-03 21:58:47