简体   繁体   English

在Windows上的Python 2.x中从命令行参数中读取Unicode字符

[英]Read Unicode characters from command-line arguments in Python 2.x on Windows

I want my Python script to be able to read Unicode command line arguments in Windows. 我希望我的Python脚本能够在Windows中读取Unicode命令行参数。 But it appears that sys.argv is a string encoded in some local encoding, rather than Unicode. 但似乎sys.argv是以某种本地编码而不是Unicode编码的字符串。 How can I read the command line in full Unicode? 如何以完整的Unicode读取命令行?

Example code: argv.py 示例代码: argv.py

import sys

first_arg = sys.argv[1]
print first_arg
print type(first_arg)
print first_arg.encode("hex")
print open(first_arg)

On my PC set up for Japanese code page, I get: 在我为日语代码页设置的PC上,我得到:

C:\temp>argv.py "PC・ソフト申請書08.09.24.doc"
PC・ソフト申請書08.09.24.doc
<type 'str'>
50438145835c83748367905c90bf8f9130382e30392e32342e646f63
<open file 'PC・ソフト申請書08.09.24.doc', mode 'r' at 0x00917D90>

That's Shift-JIS encoded I believe, and it "works" for that filename. 这是我认为的Shift-JIS编码,并且它“适用于”该文件名。 But it breaks for filenames with characters that aren't in the Shift-JIS character set—the final "open" call fails: 但它打破了文件名,其中的字符不在Shift-JIS字符集中 - 最终的“打开”调用失败:

C:\temp>argv.py Jörgen.txt
Jorgen.txt
<type 'str'>
4a6f7267656e2e747874
Traceback (most recent call last):
  File "C:\temp\argv.py", line 7,
in <module>
    print open(first_arg)
IOError: [Errno 2] No such file or directory: 'Jorgen.txt'

Note—I'm talking about Python 2.x, not Python 3.0. 注意 - 我在谈论Python 2.x,而不是Python 3.0。 I've found that Python 3.0 gives sys.argv as proper Unicode. 我发现Python 3.0将sys.argv作为正确的Unicode。 But it's a bit early yet to transition to Python 3.0 (due to lack of 3rd party library support). 但是转换到Python 3.0还有点早(由于缺乏第三方库支持)。

Update: 更新:

A few answers have said I should decode according to whatever the sys.argv is encoded in. The problem with that is that it's not full Unicode, so some characters are not representable. 一些答案说我应该根据sys.argv编码的内容进行解码。问题在于它不是完整的Unicode,所以有些字符不可表示。

Here's the use case that gives me grief: I have enabled drag-and-drop of files onto .py files in Windows Explorer . 这是让我感到悲伤的用例:我已经在Windows资源管理器中将文件拖放到.py文件中 I have file names with all sorts of characters, including some not in the system default code page. 我有各种字符的文件名,包括一些不在系统默认代码页中的字符。 My Python script doesn't get the right Unicode filenames passed to it via sys.argv in all cases, when the characters aren't representable in the current code page encoding. 在所有情况下,当在当前代码页编码中无法表示字符时,我的Python脚本无法通过sys.argv获取正确的Unicode文件名。

There is certainly some Windows API to read the command line with full Unicode (and Python 3.0 does it). 肯定有一些Windows API用完整的Unicode读取命令行(而Python 3.0就是这样)。 I assume the Python 2.x interpreter is not using it. 我假设Python 2.x解释器没有使用它。

Here is a solution that is just what I'm looking for, making a call to the Windows GetCommandLineArgvW function: 这是我正在寻找的解决方案,调用Windows GetCommandLineArgvW函数:
Get sys.argv with Unicode characters under Windows (from ActiveState) 在Windows下获取带有Unicode字符的sys.argv (来自ActiveState)

But I've made several changes, to simplify its usage and better handle certain uses. 但我做了一些改动,以简化其使用并更好地处理某些用途。 Here is what I use: 这是我使用的:

win32_unicode_argv.py

"""
win32_unicode_argv.py

Importing this will replace sys.argv with a full Unicode form.
Windows only.

From this site, with adaptations:
      http://code.activestate.com/recipes/572200/

Usage: simply import this module into a script. sys.argv is changed to
be a list of Unicode strings.
"""


import sys

def win32_unicode_argv():
    """Uses shell32.GetCommandLineArgvW to get sys.argv as a list of Unicode
    strings.

    Versions 2.x of Python don't support Unicode in sys.argv on
    Windows, with the underlying Windows API instead replacing multi-byte
    characters with '?'.
    """

    from ctypes import POINTER, byref, cdll, c_int, windll
    from ctypes.wintypes import LPCWSTR, LPWSTR

    GetCommandLineW = cdll.kernel32.GetCommandLineW
    GetCommandLineW.argtypes = []
    GetCommandLineW.restype = LPCWSTR

    CommandLineToArgvW = windll.shell32.CommandLineToArgvW
    CommandLineToArgvW.argtypes = [LPCWSTR, POINTER(c_int)]
    CommandLineToArgvW.restype = POINTER(LPWSTR)

    cmd = GetCommandLineW()
    argc = c_int(0)
    argv = CommandLineToArgvW(cmd, byref(argc))
    if argc.value > 0:
        # Remove Python executable and commands if present
        start = argc.value - len(sys.argv)
        return [argv[i] for i in
                xrange(start, argc.value)]

sys.argv = win32_unicode_argv()

Now, the way I use it is simply to do: 现在,我使用它的方式就是:

import sys
import win32_unicode_argv

and from then on, sys.argv is a list of Unicode strings. 从那时起, sys.argv是一个Unicode字符串列表。 The Python optparse module seems happy to parse it, which is great. Python optparse模块似乎很乐意解析它,这很棒。

Dealing with encodings is very confusing. 处理编码非常令人困惑。

I believe if your inputing data via the commandline it will encode the data as whatever your system encoding is and is not unicode. 相信,如果您通过命令行输入数据,它将编码数据,无论您的系统编码是什么,并且不是unicode。 (Even copy/paste should do this) (即使复制/粘贴也应该这样做)

So it should be correct to decode into unicode using the system encoding: 因此,使用系统编码解码为unicode应该是正确的:

import sys

first_arg = sys.argv[1]
print first_arg
print type(first_arg)

first_arg_unicode = first_arg.decode(sys.getfilesystemencoding())
print first_arg_unicode
print type(first_arg_unicode)

f = codecs.open(first_arg_unicode, 'r', 'utf-8')
unicode_text = f.read()
print type(unicode_text)
print unicode_text.encode(sys.getfilesystemencoding())

running the following Will output: Prompt> python myargv.py "PC・ソフト申請書08.09.24.txt" 运行以下将输出:提示> python myargv.py“PC·ソフト申请书08.09.24.txt”

PC・ソフト申請書08.09.24.txt
<type 'str'>
<type 'unicode'>
PC・ソフト申請書08.09.24.txt
<type 'unicode'>
?日本語

Where the "PC・ソフト申請書08.09.24.txt" contained the text, "日本語". “PC·ソフト申请书08.09.24.txt”中包含文字“日本语”。 (I encoded the file as utf8 using windows notepad, I'm a little stumped as to why there's a '?' in the begining when printing. Something to do with how notepad saves utf8?) (我使用Windows记事本将文件编码为utf8,我有点难以理解为什么打印时会出现'?'。与记事本如何保存utf8有什么关系?)

The strings 'decode' method or the unicode() builtin can be used to convert an encoding into unicode. 字符串'decode'方法或内置的unicode()可用于将编码转换为unicode。

unicode_str = utf8_str.decode('utf8')
unicode_str = unicode(utf8_str, 'utf8')

Also, if your dealing with encoded files you may want to use the codecs.open() function in place of the built-in open(). 此外,如果您处理编码文件,您可能希望使用codecs.open()函数代替内置的open()。 It allows you to define the encoding of the file, and will then use the given encoding to transparently decode the content to unicode. 它允许您定义文件的编码,然后使用给定的编码透明地将内容解码为unicode。

So when you call content = codecs.open("myfile.txt", "r", "utf8").read() content will be in unicode. 因此,当您调用content = codecs.open("myfile.txt", "r", "utf8").read() content将为unicode。

codecs.open: http://docs.python.org/library/codecs.html?#codecs.open codecs.open: http://docs.python.org/library/codecs.html?#codecs.open

If I'm miss-understanding something please let me know. 如果我想念一些东西,请告诉我。

If you haven't already I recommend reading Joel's article on unicode and encoding: http://www.joelonsoftware.com/articles/Unicode.html 如果你还没有我推荐阅读Joel关于unicode和编码的文章: http//www.joelonsoftware.com/articles/Unicode.html

Try this: 尝试这个:

import sys
print repr(sys.argv[1].decode('UTF-8'))

Maybe you have to substitute CP437 or CP1252 for UTF-8 . 也许您必须将CP437CP1252替换为UTF-8 You should be able to infer the proper encoding name from the registry key HKEY_LOCAL_MACHINE\\SYSTEM\\CurrentControlSet\\Control\\Nls\\CodePage\\OEMCP 您应该能够从注册表项HKEY_LOCAL_MACHINE\\SYSTEM\\CurrentControlSet\\Control\\Nls\\CodePage\\OEMCP推断正确的编码名称

The command line might be in Windows encoding. 命令行可能是Windows编码。 Try decoding the arguments into unicode objects: 尝试将参数解码为unicode对象:

args = [unicode(x, "iso-8859-9") for x in sys.argv]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM