简体   繁体   English

读取文本文件时更改为可识别的编码?

[英]Change to recognized encoding when reading a text file?

When a text file is open for reading using (say) UTF-8 encoding, is it possible to change encoding during the reading? 当使用(例如)UTF-8编码打开文本文件进行阅读时,是否可以在阅读过程中更改编码?

Motivation: It hapens that you need to read a text file that was written using non-default encoding. 动机:很可能您需要读取使用非默认编码编写的文本文件。 The text format may contain the information about the used encoding. 文本格式可能包含有关使用的编码的信息。 Let an HTML file be the example, or XML, or ASCIIDOC, and many others. 让一个HTML文件作为示例,或者XML,或者ASCIIDOC,以及许多其他文件。 In such cases, the lines above the encoding information are allowed to contain only ASCII or some default encoding. 在这种情况下,编码信息上方的行仅允许包含ASCII或某些默认编码。

In Python, it is possible to read the file in binary mode, and translate the lines of bytes type to str on your own. 在Python中,可以以二进制模式读取文件,然后自行将bytes类型的行转换为str When the information about the encoding is found on some line, you just switch the encoding to be used when converting the lines to unicode strings. 当在某行上找到有关编码的信息时,只需切换将行转换为unicode字符串时要使用的编码。

In Python 3, text files are implemented using TextIOBase that defines also the encoding attribute, the buffer , and other things. 在Python 3中,使用TextIOBase实现文本文件,该文件还定义了encoding属性, buffer和其他内容。

Is there any nice way to change the encoding information (used for decoding the bytes ) so that the next lines would be decoded in the wanted way? 有什么好方法可以更改编码信息(用于解码bytes ),以便以所需的方式解码下一行?

Classic usage is: 经典用法是:

  • Open the file in binary format (bytes string) 以二进制格式(字节字符串)打开文件
  • read a chunk and guess the encoding (For instance with a simple scanning or using RegEx) 读取大块并猜测编码(例如,通过简单的扫描或使用RegEx)

Then: 然后:

  • close the file and re-open it in text mode with the found encoding Or 关闭文件,然后使用找到的编码以文本模式重新打开文件,或者
  • move to the beginning: seek(0), read the whole content as a bytes string then decode the content using the found encoding. 移至开头:seek(0),以字节字符串读取整个内容,然后使用找到的编码对内容进行解码。

See this example: Detect character encoding in an XML file (Python recipe) note: the code is a little old, but useful. 请参见以下示例: 检测XML文件中的字符编码(Python配方) 注意:该代码有些陈旧,但很有用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM