Python filename/path parsing wrong hebrew encoding (using optparse library)

Question

I have a problem with this snippet of code:

import optparse
parser = optparse.OptionParser(version=__version__,
    usage="%prog [options] file1 ... host[:dest]",
    description=main.__doc__)
parser.add_option("-c", "--config", help="Specify an alternate config "
    "file.  Default = '%s'" % config_file)
parser.add_option('-l', '--log-level', type="choice",
    choices=LOG_LEVELS.keys(),
    help="Override the default logging level. Choices=%s, Default=%s" %
        (",".join(LOG_LEVELS.keys()), LOG_LEVEL))
parser.add_option("-o", "--overwrite", action="store_true",
    help="If specified, overwrite existing files at destination.  If "
    "not specified, throw an exception if you try to overwrite a file")
parser.add_option('-s', "--speed", action="store_true", \
    help="If specifed, print the data transfer rate for each file "
        "that is uploaded (infers verbose option)")
parser.add_option('-v', '--verbose', action="store_true",
    help="If specified, print every file that is being uploaded and every "
        "directory that is being created")
parser.add_option("-u", "--user", help="The username to use for "
    "authentication.  Not needed if you have set up a config file.")
parser.add_option("-p", "--password", help="The password to use for "
    "authentication.  Not needed if you have set up a config file.")

parser.set_defaults(config=config_file, log_level=LOG_LEVEL)
options, args = parser.parse_args()
print (args)

As you can see, when I print the args of a test we are doing with hebrew named file, the print result includes: ['/root/mezeo_sdk/1/\\xfa\\xe5\\xeb\\xf0\\xe9\\xfa \\xf2\\xe1\\xe5\\xe3\\xe4.xlsx', 'hostname'] Instead of /root/mezeo_sdk/1/"תוכנית עבודה.xlsx"

Also, the end result once the script uploads the file to the server (the way the filename was passed) is: http://i.imgur.com/pP6fA.png

The filename itself is good on the linux source, because if I SCP it to my own computer it looks ok, but not once I transfer it to the file server using the python script.

I also dont believe the problem is on the file server side, because if I use the web interface to upload hebrew named files, they are OK.

I think the problem is with the usage of optparse library.

Answer 1

As always, I'll start with the Unicode suggested reading: you should really read either or both of

Pragmatic Unicode (Ned Batchelder)
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) (Joel Spolsky)

In a (very small) nutshell, a Unicode code point is an abstract "thingy" representing one character ¹ . Programmers like to work with these, because we like to think of strings as coming one character at a time. Unfortunately, it was decreed a long time ago that a character must fit in one byte of memory, so there can be at most 256 different characters. Which is fine for plain English, but doesn't work for anything else. There's a global list of code points -- thousands of them -- which are meant to hold every possible character, but clearly they don't fit in a byte.

The solution: there is a difference between the ordered list of code points that make a string, and its encoding as a sequence of bytes. You have to be clear whenever you work with a string which of these forms it should be in. To convert between the forms you can .encode() a list of code points (a Unicode string) as a list of bytes, and .decode() bytes into a list of code points. To do so, you need to know how to map code points into bytes and vice versa, which is the encoding.

¹ Sort of.

OK, now that that's out of the way, let's look at what you have. You have given a (raw) string -- a sequence of bytes:

\xfa\xe5\xeb\xf0\xe9\xfa \xf2\xe1\xe5\xe3\xe4

which you would like to be the encoding of

תוכנית עבודה

A little bit of Googling tells me that you are using the Windows-1255 encoding, which is an extension of ASCII using the upper bytes to hold Hebrew letters. You want to have the string in Unicode, because Unicode represents normal data. So, you should decode the sequence of bytes, using the encoding "Windows-1255" :

>>> s
'\xfa\xe5\xeb\xf0\xe9\xfa \xf2\xe1\xe5\xe3\xe4'
>>> s.decode("Windows-1255")
u'\u05ea\u05d5\u05db\u05e0\u05d9\u05ea \u05e2\u05d1\u05d5\u05d3\u05d4'

Now you have the right sort of data. Next, you need to send the data to the server, which means encoding it in a normal encoding, namely "UTF-8":

>>> s.decode("Windows-1255").encode("utf-8")
'\xd7\xaa\xd7\x95\xd7\x9b\xd7\xa0\xd7\x99\xd7\xaa \xd7\xa2\xd7\x91\xd7\x95\xd7\x93\xd7\x94'

Finally, you may wonder where the server went wrong. Well, if you don't specify an encoding for data people will have to guess, which is an enterprise doomed to failure. In your case, it looks like you sent the raw bytes to the server, which then decoded them as latin-1 . That gives the weird accented letters you see, because latin-1 uses the upper half of the ASCII bytes not for Hebrew characters but for accented English ones.

Moral of the story: understand Unicode!

Answer 2

it prints the repr() of a list; if you print the strings they should render correctly in your terminal emulator.

as for your imgur link, if that is what is shown on a webpage, you need to set the right encoding in the html.

>>> a=['/root/mezeo_sdk/1/\xfa\xe5\xeb\xf0\xe9\xfa \xf2\xe1\xe5\xe3\xe4.xlsx', 'hostname']
>>> print a[0].decode('windows-1255')
/root/mezeo_sdk/1/תוכנית עבודה.xlsx

Python filename/path parsing wrong hebrew encoding (using optparse library)

Question

2 answers

solution1
4 ACCPTED 2012-04-19 07:34:36

solution2
3 2012-04-19 07:23:08

Python filename/path parsing wrong hebrew encoding (using optparse library)

Question

2 answers

solution1 4 ACCPTED 2012-04-19 07:34:36

solution2 3 2012-04-19 07:23:08

solution1
4 ACCPTED 2012-04-19 07:34:36

solution2
3 2012-04-19 07:23:08