简体   繁体   中英

Python unicode errors reading files generated by other apps

I'm getting decoding exception errors when reading export files from multiple applications. Have been running into this for a month, as I learn far more about unicode than I ever wanted to know. Some fundamentals are still missing. I understand utf, I understand codepages, I understand how they tend to be used in practice (a single codepage per document eg, though I can't imagine that's still true today--see the back page of a health statement with 15 languages.)

  1. Is it true that utf-8 can and does encode every possible unicode char? How then is it possible for one application to write a utf-8 file and another to not be able to read it?
  2. when utf is used, codepages are NOT used, is that correct? as I think it through, the codepage is an older style and is made obsolete by utf. I'm sure there are some exceptions.
  3. utf could also be looked as a data compression scheme, less than an encoding one.

But there I'm stuck, as in practice, I have 6 different applications made in different countries, which can create export files, 3 in ut-f, 3 in cp1252, yet python 3.7 cannot read them without error:

'charmap' codec can't decode byte 0x9d in position 1555855: character maps to 'charmap' codec can't decode byte 0x81 in position 4179683: character maps to

I use Edit Pro to examine the files, which successfully reads the files. It points to a line that contains an extra pair of special double quotes: "Metro Exodus review: “Not only the best Metro yet, it's one of the best shooters in years” | GamesRadar+"

Removing that ” allows python to continue reading in the file, to the next error.

python reports it as char x9d, but an (really old: Codewright) old editor reports it as x94. Codewright I believe. Verified it is an x94 and x93 pair on the inte.net so it must be true. ;-)

It is very troublesome that I don't know for sure what the actual bytes are, as there are so many layers of translation, interpretation, format for display, etc.

So the visual studio debug report of x9d is a misdirect. What's going on with the python library that it would report this?

How is this possible? I can find no info about how chars in one codepage can be invalid under utf (if that's the problem). What would I search under?

It should not be this hard. I have 30 years experience in programming c++, sql, you name it, learning new libraries, languages is just breakfast.

I also do not understand why the information to handle this is so hard to find. Surely numerous other programmers doing data conversions, import/exports between applications have run into this for decades.

The files I'm importing are csv files from 6 apps, and json files from another. the 6 apps export in utf-8 and cp1252 (as reported by Edit Pro) and the other app exports json in utf-8, though I could also choose csv.

The 6 apps run on an iPhone and export files I'm attempting to read on windows 10. I'm running python 3.7.8, though this problem has persisted since 3.6.3.

Thanks in advance

Dan

The error 'charmap' codec can't decode byte... shows that you are not using utf-8 to read the file. That's the source of your struggles on this one. Unless the file starts with a BOM (byte order mark), you kinda have to know how the file was encoded to decode it correctly.

  1. utf-8 encodes all unicode characters and python should be able to read them all. Displaying is another matter. You need font files for the unicode characters to do that part. You were reading in "charmap", not "utf-8" and that's why you had the error.

  2. "when utf is used"... there are several UTF encodings. utf-8, utf-16-be (big endian), utf-16-le (little endian), utf-16 (synonym for utf-16-le), utf-32 variants (I've never seen this in the wild) and variants that include the BOM (byte order mark) which is an optional set of characters at the start of the file describing utf encoding type.

But yes, UTF encodings are meant to replace the older codepage encodings.

  1. No, its not compression. The encoded stream could be larger than the bytes needed to hold the string in memory. This is especially true of utf-8, less true with utf-16 (that's why Microsoft went with utf-16). But utf-8 as a superset of ASCII that does not have byte order issues like utf-16 has many other advantages (that's why all the sane people chose it). I can't think of a case where a UTF encoding would ever be smaller than the count of its characters.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM