简体   繁体   中英

Change to recognized encoding when reading a text file?

When a text file is open for reading using (say) UTF-8 encoding, is it possible to change encoding during the reading?

Motivation: It hapens that you need to read a text file that was written using non-default encoding. The text format may contain the information about the used encoding. Let an HTML file be the example, or XML, or ASCIIDOC, and many others. In such cases, the lines above the encoding information are allowed to contain only ASCII or some default encoding.

In Python, it is possible to read the file in binary mode, and translate the lines of bytes type to str on your own. When the information about the encoding is found on some line, you just switch the encoding to be used when converting the lines to unicode strings.

In Python 3, text files are implemented using TextIOBase that defines also the encoding attribute, the buffer , and other things.

Is there any nice way to change the encoding information (used for decoding the bytes ) so that the next lines would be decoded in the wanted way?

Classic usage is:

  • Open the file in binary format (bytes string)
  • read a chunk and guess the encoding (For instance with a simple scanning or using RegEx)

Then:

  • close the file and re-open it in text mode with the found encoding Or
  • move to the beginning: seek(0), read the whole content as a bytes string then decode the content using the found encoding.

See this example: Detect character encoding in an XML file (Python recipe) note: the code is a little old, but useful.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM