简体   繁体   English

Beautfiul Soup 3:将两字节Unicode序列转换为实际的Unicode字符

[英]Beautfiul Soup 3: convert two-byte Unicode sequences to actual Unicode characters

I'm using Beautiful Soup 3 and Python 2.7 for scraping utf-8 encoded web pages that contain non-ASCII characters (umlauts). 我正在使用Beautiful Soup 3和Python 2.7抓取包含非ASCII字符(变音符)的utf-8编码的网页。 I'm getting the text that I want, but all Unicode characters are returned as two-byte character sequences instead of the actual Unicode character. 我正在获取所需的文本,但是所有Unicode字符都以两字节字符序列而不是实际的Unicode字符返回。 (The string is obtained by using soup.find() and converting the the NavigableString results into a string with str().) (该字符串是通过使用soup.find()并将NavigableString结果转换为带有str()的字符串而获得的。)

For example: I get Fahrvergnügen instead of Fahrvergn ü gen. 例如:我得到Fahrvergnügen代替Fahrvergnü根。

I've tried pretty much all permutations of encode('utf-8'), decode('utf-8') and unicode() but nothing returns the umlaut instead of the weird two-byte sequence. 我已经尝试了encode('utf-8'),decode('utf-8')和unicode()的几乎所有排列,但是没有任何东西返回变音符号,而不是奇怪的两字节序列。

I'm pretty sure that there's a simple solution, I just can't figure out what command to use to convert a BS NavigableString or a plain old string that contains Fahrvergnügen to Fahrvergnügen or ensure that the weird two-byte sequences aren't returned in the first place. 我很确定有一个简单的解决方案,只是无法找出将BS NavigableString或包含Fahrvergnügen的普通旧字符串转换为Fahrvergnügen的简单命令,或者确保不会返回奇怪的两字节序列首先。

BTW, ü is C3BC, however, the code for a lower case u umlaut is 00FC. 顺便说一句,¼是C3BC,但是,小写的小写字母的代码是00FC。

The characters you are looking at look like double-encoded UTF-8. 您正在查看的字符看起来像是双重编码的UTF-8。 If the input is hosed, there really isn't anything BeautifulSoup can do to rectify it. 如果输入是软管,那么BeautifulSoup实际上无法采取任何措施来纠正它。

BeautifulSoup basically returns Unicode always, which is just as it should be (unless you are actually into manipulating encodings, in which case it's a hopeless hassle). BeautifulSoup基本上总是返回Unicode,这应该是应有的状态(除非您实际上是在操纵编码,在这种情况下,这是毫无希望的麻烦)。

It is possible, though unlikely, that BeautifulSoup is the source for the double-encoding. 尽管不太可能,BeautifulSoup可能是双重编码的来源。 You can override the character set of the scraped page if you are certain that it is properly UTF-8; 如果您确定抓取页面的字符集是正确的UTF-8,则可以覆盖它; use BeautifulSoup(..., fromEncoding='utf-8') when creating the BeautifulSoup object. 创建BeautifulSoup对象时BeautifulSoup(..., fromEncoding='utf-8')使用BeautifulSoup(..., fromEncoding='utf-8')

"Fahrvergnügen" in UTF-8 is represented by the bytes 46 61 68 72 76 65 72 67 6e c3 bc 67 65 6e (hex) where c3 bc is the UTF-8 encoding of U+00FC . UTF-8中的“Fahrvergnügen”由字节46 61 68 72 76 65 72 67 6e c3 bc 67 65 6e(十六进制)表示,其中c3 bcU + 00FC的UTF-8编码。

When this string is incorrectly converted as if it were in a legacy 8-bit encoding such as ISO-8859-1 (where 0xc3 is à and 0xbc is ¼) the result is 46 61 68 72 76 65 72 67 6e c3 83 c2 bc 67 65 6e which is presumably what you are looking at. 如果此字符串被错误地转换,就好像是使用旧的8位编码(例如ISO-8859-1)(其中0xc3是Ã,0xbc是¼)一样,结果是46 61 68 72 76 65 72 67 6e c3 83 c2 bc 67 65 6e大概就是您正在看的东西。

You can revert this double-encoding if you know precisely the nature of the error, but this is not (straightforwardly) automatizable -- you need to examine every encoding error and figure out (or guess) which characters it is properly supposed to represent. 如果您确切地知道错误的性质,则可以还原此双重编码,但这不是(直接)可自动化的-您需要检查每个编码错误并找出(或猜测)它应该代表哪个字符。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM