Python open encoding failure

Question

I have a script that logs data on a Windows machine (Win 7) using Python 2.7. I want to read these files on my RHEL machine using Python 3.5. I keep getting the following error (or similar):

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 825929: ordinal not in range(128)

To make matters worse, the data are being passed into the computer in hex/ascii format (why the manufacturer did this I do not know) so the integer 27035 shows up in the text file as 0x699b. So the data will look something like this:

0001100011000190001600011000110001300013000120001200013000140001a0002

Two write the data in Python 2.7 I simply do:

with open('dst.txt', 'w') as fid:
    fid.write(data_stream)

I had no problem reading these files when using 2.7 on my office computer, but after switching to 3.5 I do.

This used to work under 2.7:

with open('src.txt', 'r') as tmp:
    data = tmp.read().split('\n')

Using the same script under 3.5 caused errors (as above), so I defined the encoding:

with open('src.txt', 'r', encoding='latin-1') as tmp:
    data = tmp.read().split('\n')

This works most of the time (strange, because "open" under Python 2.7 should default to encoding='ascii'...NOTE: defining encoding as "ascii" still results in errors), I can at least read the file this way. The problem now is that not all of the lines contain the same number of characters (they should!). Infrequently I will have a line missing one or two characters. I find the shorter lines via:

for r in data:
    if len(r) < 7721:
        print(r)

Within these lines I find strange characters like:

Ö\221Á
Ö\231\231Ù

where \\221 and \\231 show up as single characters (ie not four as you would expect).

I guess my question is: what is going on here? I could throw away rows that do not have enough characters (this would be less than 1% of the data), but it just irks me that this does not work.

Is this caused by the data being converted to hex first, then written into ascii encoding, followed by decoding via latin-1 (that is a lot going on). If that is the case, then why can I not decode the data by specifying ascii encoding?

EDIT I loaded the data different ways:

open('src.txt', 'rU', encoding='latin-1')
open('src.txt', 'rb')
open('src.txt', 'rU', encoding='Windows-1252')

The data remain the same, but the mis-translated portions changed:

fÖ\211Áffe7700
f\xd6\x89\xc1ffe7700
fÖ‰Áffe7700

Whatever is between the "f" and "ffe7700" is what is not working.

Answer 1

Perhaps the file is not latin-1.

I would use chardet to detect the file encoding.

$ chardetect src.txt

Python open encoding failure

Question

1 answers

solution1
0 2017-05-19 19:12:24

Python open encoding failure

Question

1 answers

solution1 0 2017-05-19 19:12:24

solution1
0 2017-05-19 19:12:24