简体   繁体   中英

Maximum number of characters output from Win32 ToUnicode()/ToAscii()

What is the maximum number of characters that could be output from the Win32 functions ToUnicode()/ToAscii()?

Surely there is a sensible upper bound on what it can output given a virtual key code, scan key code, and keyboard state?

On my Windows 8 machine USER32!ToAscii calls USER32!ToUnicode with a internal buffer and cchBuff set to 2. Because the output of ToAscii is a LPWORD and not a LPSTR we cannot assume anything about the real limits of ToUnicode from this investigation but we know that ToAscii is always going to output a WORD . The return value tells you if 0, 1 or 2 bytes of this WORD contains useful data.

Moving on to ToUnicode and things get a bit trickier. If it returns 0 then nothing was written. If it returns 1 or -1 then one UCS-2 code point was written. We are then left with the strange 2 <= return expression. We can try to dissect the MSDN documentation:

Two or more characters were written to the buffer specified by pwszBuff. The most common cause for this is that a dead-key character (accent or diacritic) stored in the keyboard layout could not be combined with the specified virtual key to form a single character. However, the buffer may contain more characters than the return value specifies. When this happens, any extra characters are invalid and should be ignored.

You could interpret this as "two or more characters were written but only two of them are valid" but then the return value should be documented as 2 and not 2 ≤ value .

I believe there are two things going on in that sentence and we should eliminate what it calls "extra characters":

However, the buffer may contain more characters than the return value specifies.

This just implies that the function may party on your buffer beyond what it is actually going to return as valid. This is confirmed by:

When this happens, any extra characters are invalid and should be ignored.

This just leaves us with the unfortunate opening sentence:

Two or more characters were written to the buffer specified by pwszBuff.

I have no problem imagining a return value of 2, it can be as simple as a base character combined with a diacritic that does not exist as a pre-composed code point.

The "or more" part could come from multiple sources. If the base character is encoded as a surrogate-pair then any additional diacritic/combining-character will push you over 2. There could simply also be more than one diacritic/combining-character on the base character. There might even be a leading LTR/RTL mark.

I don't know if it is possible to end up with all 3 conditions at the same time but I would play it safe and specify a buffer of 10 or so WCHARs. This should be well within the limits of what you can produce on a keyboard with "a single keystroke".

This is by no means a final answer but it might be the best you are going to get unless somebody from Microsoft responds.

A ligature in keyboard terminology means when a single key outputs two or more UTF-16 codepoints. Note that some languages use scripts that are outside of the BMP (Basic Multilingual Plane) and need to be completely realized by ligatures of surrogate pairs (two UTF-16 codepoints).

If we want to look from a practical side of things : Here is a list of Windows system keyboard layouts that are using ligatures.

51 out of 208 system layouts have ligatures

So as we can see from the practical side - we already can have up to 4 wchar_t for one ToUnicode() call (for one keypress).

If we want to look from a theoretical perspective - we can look at kbd.h in Windows SDK where underlying keyboard layout structures are defined:


/*
 * Macro for ligature with "n" characters
 */
#define TYPEDEF_LIGATURE(n) typedef struct _LIGATURE##n {     \
                                    BYTE  VirtualKey;         \
                                    WORD  ModificationNumber; \
                                    WCHAR wch[n];             \
                                } LIGATURE##n, *KBD_LONG_POINTER PLIGATURE##n;

/*
 * Table element types (for various numbers of ligatures), used
 * to facilitate static initializations of tables.
 *
 * LIGATURE1 and PLIGATURE1 are used as the generic type
 */
TYPEDEF_LIGATURE(1) // LIGATURE1, *PLIGATURE1;
TYPEDEF_LIGATURE(2) // LIGATURE2, *PLIGATURE2;
TYPEDEF_LIGATURE(3) // LIGATURE3, *PLIGATURE3;
TYPEDEF_LIGATURE(4) // LIGATURE4, *PLIGATURE4;
TYPEDEF_LIGATURE(5) // LIGATURE5, *PLIGATURE5;

typedef struct tagKbdLayer {
....

    /*
     * Ligatures
     */
    BYTE       nLgMax;
    BYTE       cbLgEntry;
    PLIGATURE1 pLigature;
....
} KBDTABLES, *KBD_LONG_POINTER PKBDTABLES;
  • nLgMax here - is a size of a LIGATURE##n.wch[n] array (affects the size of each pLigature object).
  • cbLgEntry is a number of pLigature objects.

So we have a BYTE value here - and that meant that ligature size could be up to 255 wchar_t 's (UTF-16 code points) theoretically.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM