简体   繁体   English

读取文件时出现 UnicodeEncodeError

[英]UnicodeEncodeError when reading a file

I am trying to read from rockyou wordlist and write all words that are >= 8 chars to a new file.我正在尝试从rockyou wordlist中读取并将> = 8个字符的所有单词写入新文件。

Here is the code -这是代码 -

def main():
    with open("rockyou.txt", encoding="utf8") as in_file, open('rockout.txt', 'w') as out_file:
        for line in in_file:
            if len(line.rstrip()) < 8:
                continue
            print(line, file = out_file, end = '')
        print("done")

if __name__ == '__main__':
    main()

Some words are not utf-8.有些词不是utf-8。

Traceback (most recent call last): File "wpa_rock.py", line 10, in <module> main() File "wpa_rock.py", line 6, in main print(line, file = out_file, end = '') File "C:\\Python\\lib\\encodings\\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\ๅ' in position 0: character maps to <undefined>

Update更新

def main():
with open("rockyou.txt", encoding="utf8") as in_file, open('rockout.txt', 'w', encoding="utf8") as out_file:
    for line in in_file:
        if len(line.rstrip()) < 8:
            continue
        out_file.write(line)
    print("done")

if __name__ == '__main__':
    main()```

Traceback (most recent call last): File "wpa_rock.py", line 10, in <module> main() File "wpa_rock.py", line 3, in main for line in in_file: File "C:\\Python\\lib\\codecs.py", line 321, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 933: invali d continuation byte

Your UnicodeEncodeError: 'charmap' error occurs during writing to out_file (in print() ).您的UnicodeEncodeError: 'charmap'错误发生在写入out_file (在print() )。

By default, open() uses locale.getpreferredencoding() that is ANSI codepage on Windows (such as cp1252 ) that can't represent all Unicode characters and '\ๅ' character in particular.默认情况下, open()使用locale.getpreferredencoding() ,它是 Windows 上的 ANSI 代码页(例如cp1252 ),它不能表示所有 Unicode 字符,特别是'\ๅ'字符。 cp1252 is a one-byte encoding that can represent at most 256 different characters but there are a million ( 1114111 ) Unicode characters. cp1252是一种单字节编码,最多可以表示256不同的字符,但有一百万 ( 1114111 ) 个 Unicode 字符。 It can't represent them all.它不能代表他们所有人。

Pass encoding that can represent all the desired data eg, encoding='utf-8' must work (as @robyschek suggested )—if your code reads utf-8 data without any errors then the code should be able to write the data using utf-8 too.传递可以表示所有所需数据的encoding ,例如encoding='utf-8'必须有效(如@robyschek 建议的那样)—如果您的代码读取utf-8数据没有任何错误,那么代码应该能够使用utf-8写入数据utf-8也是。


Your UnicodeDecodeError: 'utf-8' error occurs during reading in_file ( for line in in_file ).您的UnicodeDecodeError: 'utf-8'错误发生在读取in_filefor line in in_file in_file for line in in_file )。 Not all byte sequences are valid utf-8 eg, os.urandom(100).decode('utf-8') may fail.并非所有字节序列都是有效的 utf-8,例如os.urandom(100).decode('utf-8')可能会失败。 What to do depends on the application.做什么取决于应用程序。

If you expect the file to be encoded as utf-8;如果您希望文件被编码为 utf-8; you could pass errors="ignore" open() parameter, to ignore occasional invalid byte sequences.您可以传递errors="ignore" open()参数,以忽略偶尔出现的无效字节序列。 Or you could use some other error handlers depending on your application .或者您可以根据您的应用程序使用其他一些错误处理程序

If the actual character encoding used in the file is different then you should pass the actual character encoding.如果文件中使用的实际字符编码不同,则应传递实际字符编码。 bytes by themselves do not have any encoding—that metadata should come from another source (though some encodings are more likely than others: chardet can guess ) eg, if the file content is an http body then see A good way to get the charset/encoding of an HTTP response in Python bytes本身没有任何编码——元数据应该来自另一个来源(尽管有些编码比其他编码更有可能: chardet可以猜测)例如,如果文件内容是 http 正文,那么请参阅获取字符集的好方法/ Python 中 HTTP 响应的编码

Sometimes a broken software can generate mostly utf-8 byte sequences with some bytes in a different encoding.有时,损坏的软件可以生成大部分 utf-8 字节序列,其中一些字节采用不同的编码。 bs4.BeautifulSoup can handle some special cases . bs4.BeautifulSoup可以处理一些特殊情况 You could also try ftfy utility/library and see if it helps in your case eg, ftfy may fix some utf-8 variations .您也可以尝试ftfy实用程序/库,看看它是否对您的情况有帮助,例如, ftfy可能会修复一些 utf-8 变体

嘿,我有一个类似的问题,在案件rockyou.txt词表,我尝试了一些编码了Python不得不报价,我发现, encoding = 'kio8_u'工作来读取文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM