俄语文件中的西里尔字母错误-无法正确解码/加长

Question

I've been struggling with encoding for a while as I'm biulding a multi-lingual database with sqlite3 in Python. 我一直在努力编码，因为我正在使用Python中的sqlite3来构建多语言数据库。 So far, I've solved everything, thanks to Google and articles on Stack Overflow. 到目前为止，由于Google和Stack Overflow上的文章，我已经解决了所有问题。 I had problems with Russian, Slovenian, Polish, Spanish, French... but it's all solved, appart from this ONE file I can't fix. 我遇到了俄文，斯洛文尼亚文，波兰文，西班牙文，法文等问题，但是都解决了，我无法解决这个文件。

I thought I had found a possible solution on this website: http://www.smashingmagazine.com/2012/06/06/all-about-unicode-utf8-character-sets/ , I even found a decoder, which got me reeeally close to solving the problem. 我以为自己在此网站上找到了可能的解决方案： http : //www.smashingmagazine.com/2012/06/06/all-about-unicode-utf8-character-sets/ ，我什至找到了一个解码器，这使我非常接近解决问题。 But it only produced partially understandable Russian... (I'm sure it can help in other cases though: http://2cyr.com/decode/?lang=fr and it also exists in English). 但是它只产生了部分可以理解的俄语……（我相信它在其他情况下也可以提供帮助： http ： //2cyr.com/decode/？lang = fr，并且它也存在英语）。

But this last file is gonna be the end of me. 但是最后一个文件将是我的尽头。 Here's the major issue: I KNOW it's Russian because the linguist who gave it to me built it, and knows it's in Russian. 这是主要的问题：我知道它是俄语的，因为把它交给我的语言学家建造了它，并且知道它是俄语的。 BUT, the file itself looks like this: 但是，文件本身看起来像这样：

£ËÁÀÝÅÅ UNK £ËÁÀÝÉÊ UNKA
£ËÁÀÝÅÇÏ    UNK £ËÁÀÝÉÊ UNKA
£ËÁÀÝÅÊ UNK £ËÁÀÝÉÊ UNKA
£ËÁÀÝÅÍ UNK £ËÁÀÝÉÊ UNKA
£ËÁÀÝÅÍÕ    UNK £ËÁÀÝÉÊ UNKA

According to my shell, it's encoded in utf-8. 根据我的外壳，它是用utf-8编码的。 I've therefore been trying to decode utf-8 and encode it into all russian encodings I could find (ISO-8859-5, koi8_r, koi8_u, cp1252, cp1251...). 因此，我一直在尝试解码utf-8并将其编码为我能找到的所有俄语编码（ISO-8859-5，koi8_r，koi8_u，cp1252，cp1251 ...）。 It never worked. 它从来没有奏效。 I also tried saving the file in all these encodings and decoding the other way around, without much success... 我还尝试过以所有这些编码保存文件，然后以其他方式解码，但没有成功...

It has to go in a database (sqlite), and I know the required encoding for this is utf-8. 它必须进入数据库（sqlite），并且我知道为此所需的编码是utf-8。 The previous Russian file I delt with was "properly" written (in cyrillic), and I just had to figure out which encoding to use. 我以前使用过的俄语文件是“正确地”写的（西里尔字母），我只需要弄清楚要使用哪种编码。 But here, I feel like I've tried everything, I'm just not getting any results... 但是在这里，我觉得我已经尝试了一切，但没有得到任何结果...

I'm actually wondering if decoding such a file is even possible, since it's not cyrillic to start with. 我实际上是在想是否可以解码这样的文件，因为从一开始它就不会太过花哨。

Any suggestion would be welcome :) 任何建议都将受到欢迎:)

Answer 1

The first and foremost problem - the text is not in UTF-8, it is in KOI8R. 第一个也是最重要的问题-文本不在UTF-8中，而是在KOI8R中。 So if you need to decode via Python, you may refer to this answer - string encode / decode - it might give you some clue. 因此，如果您需要通过Python进行解码，则可以参考以下答案- 字符串编码/解码 -它可能会为您提供一些线索。

I have decoded the text specified by you - enjoy: 我已经解码了您指定的文字-请尽情享受：

ёкающее UNK ёкающий UNKA
ёкающего    UNK ёкающий UNKA
ёкающей UNK ёкающий UNKA
ёкающем UNK ёкающий UNKA
ёкающему    UNK ёкающий UNKA

俄语文件中的西里尔字母错误-无法正确解码/加长

问题描述

1 个解决方案

解决方案1
0 已采纳 2014-05-06 15:00:39

俄语文件中的西里尔字母错误-无法正确解码/加长

问题描述

1 个解决方案

解决方案1 0 已采纳 2014-05-06 15:00:39

解决方案1
0 已采纳 2014-05-06 15:00:39