简体   繁体   English

将未知字符串转换为Unicode-Python

[英]Converting unknown string to Unicode - Python

I want to use this dictionary file that is supposed to contain Japanese characters, but for some reason, it shows non-sense characters such as "ä¹™ 勹 月 ç”° 亀" . 我想使用应该包含日语字符的字典文件 ,但是由于某种原因,它显示了无意义的字符,例如"ä¹™ 勹 月 ç”° 亀"

The introduction of the file states: "The encoding scheme now in use is no longer EUC-JP and the convenient 2 bytes for the JIS x 208 and 3 bytes for the JIS x 0212 . The encoding of this file is now UTF-8 , and as such, the byte length of each character is highly variable. Processing Unicode properly requires that your software does not rely on a fixed byte length. The primary reason for the change of encoding method is that the JIS x 0213 standard kanji are not defined in the Extended Unix Code Japanese encoding scheme which predates it ( EUC-JP )." 该文件的简介指出:“现在使用的编码方案不再是EUC-JP ,对于JIS x 208方便的2个字节,对于JIS x 0212方便的3个字节。该文件的编码现在是UTF-8 ,因此,每个字符的字节长度是高度可变的。处理Unicode正确要求您的软件不依赖固定的字节长度。更改编码方法的主要原因是未定义JIS x 0213标准汉字早于它的扩展Unix代码日语编码方案( EUC-JP )。”

I tried without success to decode it using python 3: 我尝试使用python 3解码失败,但未成功

unknown_string = "𪚲 : ä¹™ 勹 月 ç”° 亀" decoded_string = unknown_string.decode('unicode_escape').encode('latin-1').decode('utf8') print(decoded_string) (results in printing 𪚲 : ä¹™ 勹 月 ç”° 亀) unknown_string = "𪚲 : ä¹™ 勹 月 ç”° 亀" decoded_string = unknown_string.decode('unicode_escape').encode('latin-1').decode('utf8') print(decoded_string)打印𪚲:乙勹月田亀)

unknown_string = "𪚲 : ä¹™ 勹 月 ç”° 亀" decoded_string = unknown_string.encode('latin1').decode('utf-8') print(decoded_string) (results in UnicodeEncodeError: 'latin-1' codec can't encode character '\š' in position 2: ordinal not in range(256) ) unknown_string = "𪚲 : ä¹™ 勹 月 ç”° 亀" decoded_string = unknown_string.encode('latin1').decode('utf-8') print(decoded_string) (结果在UnicodeEncodeError: 'latin-1' codec can't encode character '\š' in position 2: ordinal not in range(256)

I also tried looking at the bytes, but I see no connection. 我也尝试查看字节,但看不到任何连接。 For instance, 化's hex value is \\xE5\\x8C\\x96 , but it is replaced in the file with 化 that has the value \\xC3\\xA5\\xC5\\x92\\xE2\\x80\\x93 . 例如,化的十六进制值为\\xE5\\x8C\\x96 ,但在文件中替换为化值为\\xC3\\xA5\\xC5\\x92\\xE2\\x80\\x93

How could I retrieve the original Japanese characters? 我如何检索原始的日语字符?

You can use this file from jpnetkit to prepare the dict: 您可以使用jpnetkit中的此文件来准备字典:

krad = Kradfile()
krad.get_kradfile()
kanji_dict = krad.prepare_radikals()
for w in kanji_dict[u'㔟']:
    print w
# 丶 力 九 生

If you want to do the parse from scratch, you can check this link to get an unbroken version. 如果要从头开始进行解析,则可以检查此链接以获得完整的版本。

Or just download and unzip it. 或者只是下载并解压缩。 I think the reason is that the link should be downloaded, not opened in the browser. 我认为原因是应该下载链接,而不是在浏览器中打开链接。 When the browser opens this gz file, it doesn't know the charset and thus displays non-sense characters. 当浏览器打开此gz文件时,它不知道字符集,因此显示了无意义的字符。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM