Output is formatted incorrectly

Question

I am trying to run some code in PowerShell from a book I am reading to learn Python(3.7) but my output is not as expected and I can't see where I am going wrong.

This is the code:

from sys import argv

script, input_file = argv

def print_all(f):
    print(f.read())

def rewind(f):
    f.seek(0)

def print_a_line(line_count, f):
    print(line_count, f.readline())

current_file = open(input_file)

print("First let's print the whole file:\n")

print_all(current_file)

print("Now let's rewind, kind of like a tape.")

rewind(current_file)

print("Let's print three lines:")

current_line = 1
print_a_line(current_line, current_file)

current_line = current_line + 1
print_a_line(current_line, current_file)

current_line = current_line + 1
print_a_line(current_line, current_file)

The formatting of the output is where things seem to go wrong.

As you can see there is ay added to the beginning of every line, and in the part where 1 line should be printed it is skipping the second.

The file test.txt contains:

this is line 1
this is line 2
this is line 3

Ps. I know there are more effective way's of doing some of these operations, but that is not the point here.

Answer 1

The first two bytes of your file are 0xFF and 0xFE. This is a "byte order mark" that indicates that the encoding of the file is Unicode 16 bit little-endian. Take a look at the third row in the table in the wikipedia page ; it shows the same two characters, ÿþ , that you see in your output.

To read the file, give the argument encoding='UTF-16' in the open call:

current_file = open(input_file, encoding='UTF-16')

Answer 2

The problem is that you're trying to treat UTF-16-LE data—from files, or powershell pipes, or something else—as UTF-8 or Latin-1 or cp1252 or similar.

The solution is probably something like this:

current_file = open(input_file, encoding='utf-16')

More generally, you're supposed to know what kind of files you're reading. A UTF-16-with-BOM text file, a UTF-8 text file, and a whatever-my-OEM-code-page-is text file are all different things, and you need to pass the right encoding. Otherwise, you're just asking Python to pick a default and crossing your fingers.

To understand why this happens:

You only have plain English characters, which are all encodable in ASCII.

In UTF-16, each of those characters takes two bytes. One byte is the same as the ASCII value of that character, the other is 0.

In UTF-8, Latin-1, or another ASCII-compatible encoding, each of these characters takes one byte, the same one byte as in ASCII.

So, if you try to read the UTF-16 as if it were UTF-8 or Latin-1, every even byte is the character you want, and every odd byte is a 0, which means the NUL character. Depending on how you print things out, this NUL characters may be invisible, or print as spaces, or even truncate the string.

The extra two characters at the start are the two bytes from BOM—which is how you're supposed to distinguish UTF-16-LE from UTF-16-BE—being read as Latin-1 characters. The BOM is a special character U+FEFF , which shows up as the two bytes \\xFF and \\xFE in UTF-16-LE, but \\xFE and \\xFF in UTF-16-BE. But those same bytes, in Latin-1, are the y-with-umlaut and thorn characters that you're seeing.

Output is formatted incorrectly

Question

2 answers

solution1
4 ACCPTED 2018-08-20 21:42:27

solution2
3 2018-08-20 21:45:44

Output is formatted incorrectly

Question

2 answers

solution1 4 ACCPTED 2018-08-20 21:42:27

solution2 3 2018-08-20 21:45:44

solution1
4 ACCPTED 2018-08-20 21:42:27

solution2
3 2018-08-20 21:45:44