简体   繁体   English

BeautifulSoup 奇怪地美化了非英语(西里尔文)字符的编码

[英]BeautifulSoup prettify encoding non-English (Cyrillic) characters strangely

I have HTML with Cyrillic characters.我有带有西里尔字符的 HTML。 I am using BeautifulSoup4 to process this.我正在使用 BeautifulSoup4 来处理这个。 It works great, but when I go to prettify, it converts all the Cyrillic characters to something else.它工作得很好,但是当我去美化时,它会将所有西里尔字符转换为其他字符。 Here is a dummy example using Python3:这是一个使用 Python3 的虚拟示例:

from bs4 import BeautifulSoup

hello = '<span>Привет, мир</span>'
soup = BeautifulSoup(hello, 'html.parser')
print("Before prettify:\n{}".format(soup))
soup = soup.prettify(formatter='html')
print("\nafter prettify:\n{}".format(soup))

Here is the output it generates:这是它生成的输出:

Before prettify:
<span>Привет, мир</span>

after prettify:
<span>
 &Pcy;&rcy;&icy;&vcy;&iecy;&tcy;, &mcy;&icy;&rcy;
</span>

It's formatting the HTML properly (putting the tags on their lines), but it's converting the Cyrillic characters to something else (I'm not even certain what encoding that is, to be honest.)它正在正确格式化 HTML(将标签放在它们的行上),但它将西里尔字符转换为其他字符(老实说,我什至不确定它是什么编码。)

I have tried various things to prevent this;我尝试了各种方法来防止这种情况发生; prettify(encoding=None, formatter='html') , prettify(encoding='utf-8', formatter='html') , I have also tried changing the way I create the soup object: soup = BeautifulSoup(hello.encode('utf-8'), 'html.parser') and soup = BeautifulSoup(hello, 'html.parser', from_encoding='utf-8') - nothing seems to change what happens to the Cyrillic characters during prettify. prettify(encoding=None, formatter='html') , prettify(encoding='utf-8', formatter='html') ,我也尝试改变我创建汤对象的方式: soup = BeautifulSoup(hello.encode('utf-8'), 'html.parser')soup = BeautifulSoup(hello, 'html.parser', from_encoding='utf-8') - 在美化过程中,西里尔字母字符的变化似乎没有任何改变。

I figure this must be a very simple mistake I am making with encoding parameters somewhere, but after searching the internet and BS4 documentation, I am unable to figure this out.我认为这一定是我在某处使用编码参数犯的一个非常简单的错误,但是在搜索互联网和 BS4 文档后,我无法弄清楚这一点。 Is there a way to use BeautifulSoup's prettify, but maintain the Cyrillic characters as they were originally, or is this not possible?有没有办法使用 BeautifulSoup 的美化,但保持原来的西里尔字符,或者这是不可能的?

EDIT: I have realized now (thanks to DYZ's answer), that removing formatter='html' from the call to prettify will stop BeautifulSoup from converting the Cyrillic chars.编辑:我现在已经意识到(感谢 DYZ 的回答),从对 prettify 的调用中删除formatter='html'将阻止 BeautifulSoup 转换西里尔字符。 Unfortunately, this also removes any &nbsp chars in the document.不幸的是,这也会删除文档中的任何&nbsp字符。 After having a look at BS4's output-formatters documentation , it seems the solution is likely to create a custom formatter using BS's Formatter class, and specifying this in the call to prettify - soup.prettify(formatter=my_formatter) .在查看了BS4 的 output-formatters 文档之后,似乎解决方案可能会使用 BS 的 Formatter 类创建自定义格式化程序,并在调用 prettify soup.prettify(formatter=my_formatter) I'm not sure yet what that would entail, though.不过,我还不确定这会带来什么。

From the documentation :文档

If you pass in formatter="html", Beautiful Soup will convert Unicode characters to HTML entities whenever possible.如果您传入 formatter="html",Beautiful Soup 将尽可能将 Unicode 字符转换为 HTML 实体。

If this is not desirable, do not use the HTML formatter:如果这是不可取的,请不要使用 HTML 格式化程序:

soup.prettify()
#'<span>\n Привет, мир\n</span>'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用可保留 AND 西里尔字符的格式化程序使用 BeautifulSoup 进行美化? - Prettify with BeautifulSoup using a formatter that will preserve &nbsp AND Cyrillic characters? 非英文字符的拼写校正器 - Spelling corrector for non-English characters 在Python3中更正一串非英文字符的长度 - Correct length of a string of non-English characters in Python3 Django python发送带有非英语字符的电子邮件 - Django python send email with non-english characters 按字母顺序排序非英语字符(ç、ş、ö)(Python-3x) - Sorting Non-English Characters Alphabetically(ç, ş, ö) (Python-3x) 附加到文本文件时,非英文字符会损坏 - While appending to text file, non-english characters get corrupted 如何在pygame的屏幕标题上添加非英文字符? - How to add non-english characters on a screen title in pygame? wordcloud 用于非英语语料库 - wordcloud for non-english corpus 非英语字符,标点符号等特殊字符或单词开头或中间的数字的单词数 - Number of words with non-English characters, special characters such as punctuation, or digits at beginning or middle of word PyPDF2 无法读取非英文字符,extractText() 返回空字符串 - PyPDF2 can't read non-English characters, returns empty string on extractText()
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM