How can I convert a str representation of a unicode string to unicode?

Question

I'm running a python program on a user's computer in Portugal, where the user's username contains unicode characters. I would like to have os.path.expanduser('~') return something functional since I use the resulting path for some file operations, but it currently returns a python str representation of a unicode string:

>>> import os
>>> os.path.expanduser('~')
'C:\\Users\\V\xe2nia'

But this is a python string ... how can I convert this to an actual unicode string that Windows will recognize as a valid filepath?

Answer 1

The function returned a byte string, not a unicode string. You need to decode it, given the encoding used for the string.

os.path.expanduser('~').decode(sys.getfilesystemencoding())

I'm making the presumption here that the encoding used was the filesystem encoding, which is avilable via sys.getfilesystemencoding() . It looks like latin-1 from here, but you can't be certain.

You can also try to pass in a unicode path to os.path.expanduser() and have Python do the decoding for you:

os.path.expanduser(u'~')

Please read up on this and other Unicode issues in the Python Unicode HOWTO . If you don't understand the difference between a encoded bytestring and a Unicode string, please do read this excellent article as well.

Answer 2

Decoding the bytestring to Unicode with the filesystemencoding will only work if the path of the home directory is actually expressible in the filesystemencoding.

On Windows, the filesystemencoding used for byte-string-file-path I/O is the locale-dependent 'ANSI code page', which is, unfortunately, never a UTF, so there are always characters that can't be represented in byte-string-file-path functions. So for example if the user's name contained a Japanese character, but it was a Western European Windows install (using code page 1252, similar to ISo-8859-1), Martijn's example would fail.

On most languages that use the C standard library byte-string-based file I/O functions, that's the end of it: in Java et al, you simply can't access files whose names have characters outside the ANSI code page.

Luckily, Python has specific support for Windows's Unicode filenames, using the native Win32 API calls instead of the C standard library. Using these, you can get the real Unicode filename as Windows understands it, avoiding the lossy mangling involved in converting it to a byte string and back.

In general you trigger Unicode filename support in Python 2 simply by passing a Unicode string into the function you're calling. Python will return Unicode strings in response:

>>> import os
>>> os.path.expanduser(u'~')
u'C:\\Users\\V\xe2nia'

How can I convert a str representation of a unicode string to unicode?

Question

2 answers

solution1
7 ACCPTED 2012-11-07 17:52:51

solution2
1 2012-11-11 01:21:34

How can I convert a str representation of a unicode string to unicode?

Question

2 answers

solution1 7 ACCPTED 2012-11-07 17:52:51

solution2 1 2012-11-11 01:21:34

solution1
7 ACCPTED 2012-11-07 17:52:51

solution2
1 2012-11-11 01:21:34