Converting Non-UTF-8 characters to UTF-8

Question

I have some files which are present on my Linux system. These files names can be other the un_eng-utf8. I want to convert them from non-utf8 character to the utf-8 character. How can I do that using C library function or python scripts.

Answer 1

If you know the character encoding that is used to encode the filenames:

unicode_filename = bytestring_filename.decode(character_encoding)
utf8filename = unicode_filename.encode('utf-8')

If you don't know the character encoding then there is no way in the general case to do the conversion without loosing data -- "non-utf8" is not specific enough eg, if you have a filename that contains b'\\xae' byte then it can be interpreted differently depending on the filename encoding -- it is u'®' in cp1252 encoding but the same byte represents u'«' in cp437 . There are modules such as chardet that allow you to guess the character encoding but it is only a guess -- "There Ain't No Such Thing as Plain Text."

Answer 2

def converttoutf8(a):
    return unicode(a, "utf-8")

now for every filename you iterate through, that will return the utf-8 formatted filename

or even better, use convmv . it converts filenames from one encoding to another and takes a directory as an argument. sounds perfect.

Converting Non-UTF-8 characters to UTF-8

Question

2 answers

solution1
2 2015-10-13 11:21:34

solution2
0 2015-10-13 10:30:46

Converting Non-UTF-8 characters to UTF-8

Question

2 answers

solution1 2 2015-10-13 11:21:34

solution2 0 2015-10-13 10:30:46

solution1
2 2015-10-13 11:21:34

solution2
0 2015-10-13 10:30:46