简体   繁体   中英

Why can python not take unicode inputs on windows console?

I have a python file named 'কাজ.py'. The file does nothing fancy. I am not concerned for that. The thing is when i try to run the file by copying and pasting the file name, it does not show 'কাজ.py' rather it shows some boxes

> python [?][?][?].py

and it raises an error like this

python: can't open file '???.py': [Errno 22] Invalid argument

but on the same console, if i write git add কাজ.py , it shows

> git add [?][?][?].py

but surprisingly it works and does not give any error.

My question is how come git can take unicode input on the same console where python cannot? Please note that i am on Windows platform and using cmd.exe

It depends whether the command uses internally the UNICODE or MBCS command line application interface. Assuming it is a C (or C++) program, it depends whether it uses a main or wmain . If it uses a unicode interface, it will get the true unicode characters (even if it cannot displays them and only displays ? ) and as such will open the correct file. But if it uses the so-called MBCS interface, characters with a code above 255 will be translated in true ? (character code 0x63) and it will try to open a wrong file.

The difference of behaviour simply proves that your git implementation is unicode compatible while you Python version (I assume 2.x) is not. Untested, but I think that Python 3 is natively Unicode compatible on Windows.

Here is a small C program that demonstrates what happens:

#include <stdio.h>
#include <windows.h>
#include <tchar.h>

int _tmain(int argc, LPTSTR argv[]) {
    int i;

    _tprintf(_T("Arguments"));
    for(i=0; i<argc; i++) {
        _tprintf(_T(" >%s<"), argv[i]);
    }
    _tprintf(_T("\n"));

    if (argc > 1) {
        LPCTSTR ix = argv[1];
        _tprintf(_T("Dump param 1 :"));
        while (*ix != 0) {
            _tprintf(_T(" %c(%x)"), *ix, ((unsigned int) *ix) & 0xffff);
            ix += 1;
        }
        _tprintf(_T("\n"));
    }
    return 0;
}

If you call it (by pasting the কাজ characters in the console) as cmdline কাজ ) you see:

...>cmdline ab???cd
Arguments >cmdline< >ab???cd<
Dump param 1 : a(61) b(62) ?(3f) ?(3f) ?(3f) c(63) d(64)

when built in MBCS mode and

...>cmdline ab???cd
Arguments >cmdline< >ab???cd<
Dump param 1 : a(61) b(62) ?(995) ?(9be) ?(99c) c(63) d(64)

when build in UNICODE mode (the 3 characters কাজ are respectively U+0995, U+09BE and U+099C in unicode)

As the information is lost in the C run time code that processes the command line arguments, nothing can be done to recover it. So you can only pass to Python3 if you want to be able to use unicode names for your scripts.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM