简体   繁体   English

Python、Windows 控制台和编码(cp 850 与 cp1252)

[英]Python, windows console and encodings (cp 850 vs cp1252)

I thought I knew everything about encodings and Python, but today I came across a weird problem: although the console is set to code page 850 - and Python reports it correctly - parameters I put on the command line seem to be encoded in code page 1252. If I try to decode them with sys.stdin.encoding, I get the wrong result.我以为我知道关于编码和 Python 的一切,但今天我遇到了一个奇怪的问题:虽然控制台设置为代码页 850 - Python 正确报告它 - 我放在命令行上的参数似乎在代码页 1252 中编码. 如果我尝试使用 sys.stdin.encoding 对它们进行解码,我会得到错误的结果。 If I assume 'cp1252', ignoring what sys.stdout.encoding reports, it works.如果我假设'cp1252',忽略 sys.stdout.encoding 报告的内容,它可以工作。

Am I missing something, or is this a bug in Python ?我是否遗漏了什么,或者这是 Python 中的错误? Windows ?窗户? Note: I am running Python 2.6.6 on Windows 7 EN, locale set to French (Switzerland).注意:我在 Windows 7 EN 上运行 Python 2.6.6,语言环境设置为法语(瑞士)。

In the test program below, I check that literals are correctly interpreted and can be printed - this works.在下面的测试程序中,我检查了文字是否被正确解释并且可以打印 - 这很有效。 But all values I pass on the command line seem to be encoded wrongly:但是我在命令行上传递的所有值似乎都被错误地编码了:

#!/usr/bin/python
# -*- encoding: utf-8 -*-
import sys

literal_mb = 'utf-8 literal:   üèéÃÂç€ÈÚ'
literal_u = u'unicode literal: üèéÃÂç€ÈÚ'
print "Testing literals"
print literal_mb.decode('utf-8').encode(sys.stdout.encoding,'replace')
print literal_u.encode(sys.stdout.encoding,'replace')

print "Testing arguments ( stdin/out encodings:",sys.stdin.encoding,"/",sys.stdout.encoding,")"
for i in range(1,len(sys.argv)):
    arg = sys.argv[i]
    print "arg",i,":",arg
    for ch in arg:
        print "  ",ch,"->",ord(ch),
        if ord(ch)>=128 and sys.stdin.encoding == 'cp850':
            print "<-",ch.decode('cp1252').encode(sys.stdout.encoding,'replace'),"[assuming input was actually cp1252 ]"
        else:
            print ""

In a newly created console, when running在新创建的控制台中,运行时

C:\dev>test-encoding.py abcé€

I get the following output我得到以下输出

Testing literals
utf-8 literal:   üèéÃÂç?ÈÚ
unicode literal: üèéÃÂç?ÈÚ
Testing arguments ( stdin/out encodings: cp850 / cp850 )
arg 1 : abcÚÇ
   a -> 97
   b -> 98
   c -> 99
   Ú -> 233 <- é [assuming input was actually cp1252 ]
   Ç -> 128 <- ? [assuming input was actually cp1252 ]

while I would expect the 4th character to have an ordinal value of 130 instead of 233 (see the code pages 850 and 1252 ).而我希望第 4 个字符的序数值为130而不是 233 (参见代码页8501252 )。

Notes: the value of 128 for the euro symbol is a mystery - since cp850 does not have it.注意:欧元符号的 128 值是个谜——因为 cp850 没有它。 Otherwise, the '?'否则,'?' are expected - cp850 cannot print the characters and I have used 'replace' in the conversions.预期 - cp850 无法打印字符,我在转换中使用了“替换”。

If I change the code page of the console to 1252 by issuing chcp 1252 and run the same command, I (correctly) obtain如果我通过发出chcp 1252将控制台的代码页更改为 1252 并运行相同的命令,我(正确)获得

Testing literals
utf-8 literal:   üèéÃÂç€ÈÚ
unicode literal: üèéÃÂç€ÈÚ
Testing arguments ( stdin/out encodings: cp1252 / cp1252 )
arg 1 : abcé€
   a -> 97
   b -> 98
   c -> 99
   é -> 233
   € -> 128

Any ideas what I'm missing ?有什么我想念的想法吗?

Edit 1: I've just tested by reading sys.stdin.编辑 1:我刚刚通过阅读 sys.stdin 进行了测试。 This works as expected: in cp850, typing 'é' results in an ordinal value of 130. So the problem is really for the command line only.这可以按预期工作:在 cp850 中,键入 'é' 会导致序数值为 130。所以问题实际上只出在命令行。 So, is the command line treated differently than the standard input ?那么,命令行的处理方式是否与标准输入不同?

Edit 2: It seems I had the wrong keywords.编辑2:看来我的关键字错误。 I found another very close topic on SO: Read Unicode characters from command-line arguments in Python 2.x on Windows .我在 SO 上找到了另一个非常接近的主题: Read Unicode characters from command-line arguments in Python 2.x on Windows Still, if the command line is not encoded like sys.stdin, and since sys.getdefaultencoding() reports 'ascii', it seems there is no way to know its actual encoding.尽管如此,如果命令行没有像 sys.stdin 那样编码,并且由于 sys.getdefaultencoding() 报告“ascii”,似乎没有办法知道它的实际编码。 I find the answer using win32 extensions pretty hacky.我发现使用 win32 扩展的答案非常 hacky。

Replying to myself:回复我自己:

On Windows, the encoding used by the console (thus, that of sys.stdin/out) differs from the encoding of various OS-provided strings - obtained through eg os.getenv(), sys.argv, and certainly many more.在 Windows 上,控制台使用的编码(因此是 sys.stdin/out 的编码)与各种操作系统提供的字符串的编码不同 - 通过例如 os.getenv()、sys.argv 等获得。

The encoding provided by sys.getdefaultencoding() is really that - a default, chosen by Python developers to match the "most reasonable encoding" the interpreter use in extreme cases. sys.getdefaultencoding() 提供的编码实际上是一个默认值,由 Python 开发人员选择,以匹配解释器在极端情况下使用的“最合理的编码”。 I get 'ascii' on my Python 2.6, and tried with portable Python 3.1, which yields 'utf-8'.我在我的 Python 2.6 上得到了“ascii”,并尝试了可移植的 Python 3.1,它产生了“utf-8”。 Both are not what we are looking for - they are merely fallbacks for encoding conversion functions.两者都不是我们想要的——它们只是编码转换函数的后备。

As this page seems to state, the encoding used by OS-provided strings is governed by the Active Code Page (ACP).正如本页所述,操作系统提供的字符串使用的编码由活动代码页 (ACP) 管理。 Since Python does not have a native function to retrieve it, I had to use ctypes:由于 Python 没有本地函数来检索它,我不得不使用 ctypes:

from ctypes import cdll
os_encoding = 'cp' + str(cdll.kernel32.GetACP())

Edit: But as Jacek suggests, there actually is a more robust and Pythonic way to do it ( semantics would need validation, but until proven wrong, I'll use this)编辑:但正如 Jacek 所建议的那样,实际上有一种更健壮和 Pythonic 的方式来做到这一点(语义需要验证,但在证明错误之前,我会使用它)

import locale
os_encoding = locale.getpreferredencoding()
# This returns 'cp1252' on my system, yay!

and then接着

u_argv = [x.decode(os_encoding) for x in sys.argv]
u_env = os.getenv('myvar').decode(os_encoding)

On my system, os_encoding = 'cp1252' , so it works.在我的系统上, os_encoding = 'cp1252' ,所以它可以工作。 I am quite certain this would break on other platforms, so feel free to edit and make it more generic.我很确定这会在其他平台上中断,所以请随意编辑并使其更通用。 We would certainly need some kind of translation table between the ACP reported by Windows and the Python encoding name - something better than just prepending 'cp'.我们当然需要 Windows 报告的 ACP 和 Python 编码名称之间的某种转换表——这比仅仅在前面加上“cp”更好。

This is a unfortunately a hack, although I find it a bit less intrusive than the one suggested by this ActiveState Code Recipe (linked to by the SO question mentioned in Edit 2 of my question).不幸的是,这是一个 hack,尽管我发现它比这个 ActiveState Code Recipe所建议的要少一些(与我的问题的编辑 2 中提到的 SO 问题相关联)。 The advantage I see here is that this can be applied to os.getenv(), and not only to sys.argv.我在这里看到的优点是这可以应用于 os.getenv(),而不仅仅是 sys.argv。

I tried the solutions.我尝试了解决方案。 It may still have some encoding problems.它可能仍然存在一些编码问题。 We need to use true type fonts.我们需要使用真正的字体。 Fix:使固定:

  1. Run chcp 65001 in cmd to change the encoding to UTF-8.在 cmd 中运行 chcp 65001 将编码更改为 UTF-8。
  2. Change cmd font to a True-Type one like Lucida Console that supports the preceding code pages before 65001将 cmd 字体更改为 True-Type 字体,例如支持 65001 之前的上述代码页的 Lucida Console

Here's my complete fix for the encoding error:这是我对编码错误的完整修复:

def fixCodePage():
    import sys
    import codecs
    import ctypes
    if sys.platform == 'win32':
        if sys.stdout.encoding != 'cp65001':
            os.system("echo off")
            os.system("chcp 65001") # Change active page code
            sys.stdout.write("\x1b[A") # Removes the output of chcp command
            sys.stdout.flush()
        LF_FACESIZE = 32
        STD_OUTPUT_HANDLE = -11
        class COORD(ctypes.Structure):
        _fields_ = [("X", ctypes.c_short), ("Y", ctypes.c_short)]

        class CONSOLE_FONT_INFOEX(ctypes.Structure):
            _fields_ = [("cbSize", ctypes.c_ulong),
            ("nFont", ctypes.c_ulong),
            ("dwFontSize", COORD),
            ("FontFamily", ctypes.c_uint),
            ("FontWeight", ctypes.c_uint),
            ("FaceName", ctypes.c_wchar * LF_FACESIZE)]

        font = CONSOLE_FONT_INFOEX()
        font.cbSize = ctypes.sizeof(CONSOLE_FONT_INFOEX)
        font.nFont = 12
        font.dwFontSize.X = 7
        font.dwFontSize.Y = 12
        font.FontFamily = 54
        font.FontWeight = 400
        font.FaceName = "Lucida Console"
        handle = ctypes.windll.kernel32.GetStdHandle(STD_OUTPUT_HANDLE)
        ctypes.windll.kernel32.SetCurrentConsoleFontEx(handle, ctypes.c_long(False), ctypes.pointer(font))

Note : You can see a font change while executing the program.注意:您可以在执行程序时看到字体变化。

Well what worked for me was using following code sniped:那么对我有用的是使用以下代码:

# -*- coding: utf-8 -*-

import os
import sys

print (f"OS: {os.device_encoding(0)}, sys: {sys.stdout.encoding}")

comparing both on some windows systems with python 3.8, showed that os.device_encoding(0) always reflected code page setting in terminal.将某些 Windows 系统上的两者与 python 3.8 进行比较,表明 os.device_encoding(0) 始终反映终端中的代码页设置。 (Tested with Powershell and with old cmd-shell on Windows 10 and Windows 7) (在 Windows 10 和 Windows 7 上使用 Powershell 和旧 cmd-shell 进行测试)

This was even true after changing the terminals code page with shell command:在使用 shell 命令更改终端代码页后,情况更是如此:

chcp 850

or eg:或例如:

chcp 1252

Now using os.device_encoding(0) for tasks like decoding a subprocess stdout result from bytes to string worked out even with Non-ASCII chars like é, ö, ³, ↓, ...现在将 os.device_encoding(0) 用于将子进程标准输出结果从字节解码为字符串的任务,即使使用非 ASCII 字符(如 é、ö、³、↓、...

So as other already pointed out on windows local setting is really just a system information, about user preferences, but not what shell actually might currently use.因此,正如其他已经指出的那样,Windows 本地设置实际上只是一个系统信息,关于用户偏好,而不是 shell 当前可能使用的实际信息。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM