Python zipfile模块-zipfile.write（）文件，文件名中包含土耳其字符

Question

在我的系统上，有许多Word文档，我想使用Python模块zipfile将它们压缩。

我已经找到解决此问题的方法，但是在我的系统上，文件名中包含德语变音符号和土耳其字符的文件。

我已经从这样的解决方案改编了该方法，因此它可以处理文件名中的德国变音符 ：

def zipdir(path, ziph):
    for root, dirs, files in os.walk(path):
        for file in files:
            current_file = os.path.join(root, file)
            print "Adding to archive -> file: "+str(current_file)
            try:
                #ziph.write(current_file.decode("cp1250")) #German umlauts ok, Turkish chars not ok
                ziph.write(current_file.encode("utf-8")) #both not ok
                #ziph.write(current_file.decode("utf-8")) #both not ok
            except Exception,ex:
                print "exception ---> "+str(ex)
                print repr(current_file)
                raise

不幸的是，我尝试为土耳其语字符添加逻辑的尝试仍然没有成功，留下了一个问题，即每次文件名包含土耳其语字符时 ，代码都会打印一个异常，例如：

exception ---> [Error 123] Die Syntax f³r den Dateinamen, Verzeichnisnamen oder
die Datentrõgerbezeichnung ist falsch: u'X:\\my\\path\\SomeTurk?shChar?shere.doc'

我已经尝试了几种字符串编码解码的东西，但是都没有成功。

有人可以帮我吗？

我编辑了上面的代码以包含注释中提到的更改。

现在显示以下错误：

...
Adding to archive -> file: X:\\my\path\blabla I blabla.doc
Adding to archive -> file: X:\my\path\bla bla³bla³bla³bla.doc
exception ---> 'ascii' codec can't decode byte 0xfc in position 24: ordinal not
in range(128)
'X:\\my\\path\\bla B\xfcbla\xfcbla\xfcbla.doc'
Traceback (most recent call last):
  File "Backup.py", line 48, in <module>
    zipdir('X:\\my\\path', zipf)
  File "Backup.py", line 12, in zipdir
    ziph.write(current_file.encode("utf-8"))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 24: ordinal
 not in range(128)

³实际上是德语ü 。

编辑

在评论中尝试了建议的事情之后，我无法解决问题。

因此，我切换到了Groovy编程语言，并使用了其Zip功能。

由于这是基于意见的讨论，因此我决定投票赞成关闭话题。

Answer 1

如果以后不需要使用任何存档程序检查ZIP文件，则可以始终将其编码为base64，然后在使用Python解压缩时将其还原。

对于任何存档者而言，这些文件名看起来都是乱码，但会保留编码。

无论如何，要获取0-128 ASCII范围字符串（或Py3中的bytes对象），您必须编码（），而不是解码（）。

encode（）将unicode（）字符串序列化为ASCII范围。

>>> u"\u0161blah".encode("utf-8")
'\xc5\xa1blah'

encode（）从那个返回到unicode（）：

>>> "\xc5\xa1blah".decode("utf-8")
u'\u0161blah'

其他任何代码页也是如此。

很抱歉强调这一点，但人们有时会对编码和解码内容感到困惑。

如果您需要文件，但是对保留元音符号和其他符号不满意，则可以使用：

u"üsdlakui".encode("utf-8", "replace")

要么：

u"üsdlakui".encode("utf-8", "ignore")

这会将未知字符替换为可能的字符，或者完全忽略任何解码/编码错误。

如果引发的错误是UnicodeDecodeError之类的错误，将可以解决问题：无法解码字符...

但是，问题将出在仅由非拉丁字符组成的文件名中。

现在可能会实际起作用：

好，

'Sömethüng'.encode("utf-8")

必然会引发“ ASCII编码错误”，因为在字符串中没有定义unicode字符，而使用应使用其他人描述的非拉丁字符来描述unicode / UTF-8字符，但定义为ASCII-文件本身不是UTF- 8编码。

而：

# -*- coding: UTF-8 -*-
u'Sömethüng'.encode("utf-8")

要么

# -*- coding: UTF-8 -*-
unicode('Sömethüng').encode("utf-8")

编码定义在文件顶部并另存为UTF-8编码应该起作用。

是的，您确实有来自OS的字符串（文件名），但这从故事开始就存在问题。

即使编码正确通过，ZIP仍然有待解决。

按照规范，ZIP应该使用CP437存储文件名，但这很少如此。

大多数存档器使用默认的OS编码（Python中为MBCS）。

而且大多数存档器都不支持UTF-8。 因此，我在这里提出的建议应该起作用，但不适用于所有存档器。

为了告诉ZIP归档程序归档文件正在使用UTF-8文件名，将flag_bits的第11位设置为True。 如我所说，其中一些不检查该位。 这是ZIP规范中的最新内容。 （嗯，确实是几年前）

在这里，我不会写完整的代码，而只需要理解它的一部分即可。

# -*- coding: utf-8 -*-
# Cannot hurt to have default encoding set to UTF-8 all the time. :D

import os, time, zipfile
zip = zipfile.ZipFile(...)
# Careful here, origname is the full path to the file you will store into ZIP
# filename is the filename under which the file will be stored in the ZIP
# It'll probably be better if filename is not a full path, but relative, not to introduce problems when extracting. You decide.
filename = origname = os.path.join(root, filename)
# Filenames from OS can be already UTF-8, but they can be a local codepage.
# I will use MBCS here to decode from it, so that we can encode to UTF-8 later.
# I recommend getting codepage from OS (from kernel32.dll on Windows) manually instead of using MBCS, but for now:
if isinstance(filename, str): filename = filename.decode("mbcs")
# Else, assume it is already a decoded unicode string.
# Prepare the filename for archive:
filename = os.path.normpath(os.path.splitdrive(filename)[1])
while filename[0] in (os.sep, os.altsep):
    filename = filename[1:]
filename = filename.replace(os.sep, "/")
filename = filename.encode("utf-8") # Get what we need
zinfo = zipfile.ZipInfo(filename, time.localtime(os.getmtime(origname))[0:6])
# Here you should set zinfo.external_attr to store Unix permission bits and set the zinfo.compression_type
# Both are optional and not a subject to your problem. But just as notice.
zinfo.flag_bits |= 0x800 # Set 11th bit to 1, announce the UTF-8 filenames.
f = open(origname, "rb")
zip.writestr(zinfo, f.read())
f.close()

我没有测试它，只是编写了代码，但这是一个主意，即使某个地方出现了一些错误。

如果这不起作用，我不知道会怎样。

Python zipfile模块-zipfile.write（）文件，文件名中包含土耳其字符

问题描述

编辑

1 个解决方案

解决方案1
0 2015-10-24 11:53:58

Python zipfile模块-zipfile.write（）文件，文件名中包含土耳其字符

问题描述

编辑

1 个解决方案

解决方案1 0 2015-10-24 11:53:58

解决方案1
0 2015-10-24 11:53:58