简体   繁体   English

俄语文件中的西里尔字母错误-无法正确解码/加长

[英]Wrong cyrillic characters in a russian file - can't decode/encore properly

I've been struggling with encoding for a while as I'm biulding a multi-lingual database with sqlite3 in Python. 我一直在努力编码,因为我正在使用Python中的sqlite3来构建多语言数据库。 So far, I've solved everything, thanks to Google and articles on Stack Overflow. 到目前为止,由于Google和Stack Overflow上的文章,我已经解决了所有问题。 I had problems with Russian, Slovenian, Polish, Spanish, French... but it's all solved, appart from this ONE file I can't fix. 我遇到了俄文,斯洛文尼亚文,波兰文,西班牙文,法文等问题,但是都解决了,我无法解决这个文件。

I thought I had found a possible solution on this website: http://www.smashingmagazine.com/2012/06/06/all-about-unicode-utf8-character-sets/ , I even found a decoder, which got me reeeally close to solving the problem. 我以为自己在此网站上找到了可能的解决方案: http : //www.smashingmagazine.com/2012/06/06/all-about-unicode-utf8-character-sets/ ,我什至找到了一个解码器,这使我非常接近解决问题。 But it only produced partially understandable Russian... (I'm sure it can help in other cases though: http://2cyr.com/decode/?lang=fr and it also exists in English). 但是它只产生了部分可以理解的俄语……(我相信它在其他情况下也可以提供帮助: http//2cyr.com/decode/?lang = fr,并且它也存在英语)。

But this last file is gonna be the end of me. 但是最后一个文件将是我的尽头。 Here's the major issue: I KNOW it's Russian because the linguist who gave it to me built it, and knows it's in Russian. 这是主要的问题:我知道它是俄语的,因为把它交给我的语言学家建造了它,并且知道它是俄语的。 BUT, the file itself looks like this: 但是,文件本身看起来像这样:

£ËÁÀÝÅÅ UNK £ËÁÀÝÉÊ UNKA
£ËÁÀÝÅÇÏ    UNK £ËÁÀÝÉÊ UNKA
£ËÁÀÝÅÊ UNK £ËÁÀÝÉÊ UNKA
£ËÁÀÝÅÍ UNK £ËÁÀÝÉÊ UNKA
£ËÁÀÝÅÍÕ    UNK £ËÁÀÝÉÊ UNKA

According to my shell, it's encoded in utf-8. 根据我的外壳,它是用utf-8编码的。 I've therefore been trying to decode utf-8 and encode it into all russian encodings I could find (ISO-8859-5, koi8_r, koi8_u, cp1252, cp1251...). 因此,我一直在尝试解码utf-8并将其编码为我能找到的所有俄语编码(ISO-8859-5,koi8_r,koi8_u,cp1252,cp1251 ...)。 It never worked. 它从来没有奏效。 I also tried saving the file in all these encodings and decoding the other way around, without much success... 我还尝试过以所有这些编码保存文件,然后以其他方式解码,但没有成功...

It has to go in a database (sqlite), and I know the required encoding for this is utf-8. 它必须进入数据库(sqlite),并且我知道为此所需的编码是utf-8。 The previous Russian file I delt with was "properly" written (in cyrillic), and I just had to figure out which encoding to use. 我以前使用过的俄语文件是“正确地”写的(西里尔字母),我只需要弄清楚要使用哪种编码。 But here, I feel like I've tried everything, I'm just not getting any results... 但是在这里,我觉得我已经尝试了一切,但没有得到任何结果...

I'm actually wondering if decoding such a file is even possible, since it's not cyrillic to start with. 我实际上是在想是否可以解码这样的文件,因为从一开始它就不会太过花哨。

Any suggestion would be welcome :) 任何建议都将受到欢迎:)

The first and foremost problem - the text is not in UTF-8, it is in KOI8R. 第一个也是最重要的问题-文本不在UTF-8中,而是在KOI8R中。 So if you need to decode via Python, you may refer to this answer - string encode / decode - it might give you some clue. 因此,如果您需要通过Python进行解码,则可以参考以下答案- 字符串编码/解码 -它可能会为您提供一些线索。

I have decoded the text specified by you - enjoy: 我已经解码了您指定的文字-请尽情享受:

ёкающее UNK ёкающий UNKA
ёкающего    UNK ёкающий UNKA
ёкающей UNK ёкающий UNKA
ёкающем UNK ёкающий UNKA
ёкающему    UNK ёкающий UNKA

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM