将非UTF-8字符转换为UTF-8

Question

I have some files which are present on my Linux system. 我的Linux系统上有一些文件。 These files names can be other the un_eng-utf8. 这些文件名可以是其他un_eng-utf8。 I want to convert them from non-utf8 character to the utf-8 character. 我想将它们从非utf8字符转换为utf-8字符。 How can I do that using C library function or python scripts. 如何使用C库函数或python脚本执行此操作。

Answer 1

If you know the character encoding that is used to encode the filenames: 如果您知道用于编码文件名的字符编码：

unicode_filename = bytestring_filename.decode(character_encoding)
utf8filename = unicode_filename.encode('utf-8')

If you don't know the character encoding then there is no way in the general case to do the conversion without loosing data -- "non-utf8" is not specific enough eg, if you have a filename that contains b'\\xae' byte then it can be interpreted differently depending on the filename encoding -- it is u'®' in cp1252 encoding but the same byte represents u'«' in cp437 . 如果您不知道字符编码，那么在一般情况下就无法进行不丢失数据的转换-“ non-utf8”不够具体，例如，如果文件名包含b'\\xae'字节那么就可以进行不同的解释取决于文件名的编码-它是u'®'在cp1252编码， 但相同的字节表示u'«'在cp437 。 There are modules such as chardet that allow you to guess the character encoding but it is only a guess -- "There Ain't No Such Thing as Plain Text." 有诸如chardet类的模块可以让您猜测字符编码，但这只是一个猜测 - “没有纯文本这样的东西。”

Answer 2

def converttoutf8(a):
    return unicode(a, "utf-8")

now for every filename you iterate through, that will return the utf-8 formatted filename 现在，对于您遍历的每个文件名，它将返回utf-8格式的文件名

or even better, use convmv . 甚至更好，请使用convmv 。 it converts filenames from one encoding to another and takes a directory as an argument. 它将文件名从一种编码转换为另一种，并以目录作为参数。 sounds perfect. 听起来很完美。

将非UTF-8字符转换为UTF-8

问题描述

2 个解决方案

解决方案1
2 2015-10-13 11:21:34

解决方案2
0 2015-10-13 10:30:46

将非UTF-8字符转换为UTF-8

问题描述

2 个解决方案

解决方案1 2 2015-10-13 11:21:34

解决方案2 0 2015-10-13 10:30:46

解决方案1
2 2015-10-13 11:21:34

解决方案2
0 2015-10-13 10:30:46