[英]BeautifulSoup prettify encoding non-English (Cyrillic) characters strangely
I have HTML with Cyrillic characters.我有带有西里尔字符的 HTML。 I am using BeautifulSoup4 to process this.
我正在使用 BeautifulSoup4 来处理这个。 It works great, but when I go to prettify, it converts all the Cyrillic characters to something else.
它工作得很好,但是当我去美化时,它会将所有西里尔字符转换为其他字符。 Here is a dummy example using Python3:
这是一个使用 Python3 的虚拟示例:
from bs4 import BeautifulSoup
hello = '<span>Привет, мир</span>'
soup = BeautifulSoup(hello, 'html.parser')
print("Before prettify:\n{}".format(soup))
soup = soup.prettify(formatter='html')
print("\nafter prettify:\n{}".format(soup))
Here is the output it generates:这是它生成的输出:
Before prettify:
<span>Привет, мир</span>
after prettify:
<span>
Привет, мир
</span>
It's formatting the HTML properly (putting the tags on their lines), but it's converting the Cyrillic characters to something else (I'm not even certain what encoding that is, to be honest.)它正在正确格式化 HTML(将标签放在它们的行上),但它将西里尔字符转换为其他字符(老实说,我什至不确定它是什么编码。)
I have tried various things to prevent this;我尝试了各种方法来防止这种情况发生;
prettify(encoding=None, formatter='html')
, prettify(encoding='utf-8', formatter='html')
, I have also tried changing the way I create the soup object: soup = BeautifulSoup(hello.encode('utf-8'), 'html.parser')
and soup = BeautifulSoup(hello, 'html.parser', from_encoding='utf-8')
- nothing seems to change what happens to the Cyrillic characters during prettify. prettify(encoding=None, formatter='html')
, prettify(encoding='utf-8', formatter='html')
,我也尝试改变我创建汤对象的方式: soup = BeautifulSoup(hello.encode('utf-8'), 'html.parser')
和soup = BeautifulSoup(hello, 'html.parser', from_encoding='utf-8')
- 在美化过程中,西里尔字母字符的变化似乎没有任何改变。
I figure this must be a very simple mistake I am making with encoding parameters somewhere, but after searching the internet and BS4 documentation, I am unable to figure this out.我认为这一定是我在某处使用编码参数犯的一个非常简单的错误,但是在搜索互联网和 BS4 文档后,我无法弄清楚这一点。 Is there a way to use BeautifulSoup's prettify, but maintain the Cyrillic characters as they were originally, or is this not possible?
有没有办法使用 BeautifulSoup 的美化,但保持原来的西里尔字符,或者这是不可能的?
EDIT: I have realized now (thanks to DYZ's answer), that removing formatter='html'
from the call to prettify will stop BeautifulSoup from converting the Cyrillic chars.编辑:我现在已经意识到(感谢 DYZ 的回答),从对 prettify 的调用中删除
formatter='html'
将阻止 BeautifulSoup 转换西里尔字符。 Unfortunately, this also removes any  
chars in the document.不幸的是,这也会删除文档中的任何
 
字符。 After having a look at BS4's output-formatters documentation , it seems the solution is likely to create a custom formatter using BS's Formatter class, and specifying this in the call to prettify - soup.prettify(formatter=my_formatter)
.在查看了BS4 的 output-formatters 文档之后,似乎解决方案可能会使用 BS 的 Formatter 类创建自定义格式化程序,并在调用 prettify
soup.prettify(formatter=my_formatter)
。 I'm not sure yet what that would entail, though.不过,我还不确定这会带来什么。
From the documentation :从文档:
If you pass in formatter="html", Beautiful Soup will convert Unicode characters to HTML entities whenever possible.
如果您传入 formatter="html",Beautiful Soup 将尽可能将 Unicode 字符转换为 HTML 实体。
If this is not desirable, do not use the HTML formatter:如果这是不可取的,请不要使用 HTML 格式化程序:
soup.prettify()
#'<span>\n Привет, мир\n</span>'
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.