简体   繁体   中英

How to enter non-BMP unicode (hexadecimal with more than 4 characters) as input to Mathematica

Problem description: Mathematica use "\\:nnnn" as the syntax for unicode input. Eg, if we enter "\\:6c34" , we get "水" ("water" in Chinese). But what if one wants to enter "\\:1f618" (face throwing a kiss). When I tried this, I got "ὡ8" , not "a face throwing a kiss" . So, Mathematica evaluates "\\:1f61" before I entered "8" .

Question: How can we delay this evaluation or how can we enter any unicode input in general (as for hexadecimal with more than 4 characters)?

Software and hardware platform: I am running Mathematica 8 on an Intel Mac. I tried both the command line version of Mathematica and Mathematica notebook, they behave the same.

Thank you.


Reflections: Unicode is an extensible standard and it can grow (and it does grow:)). Software systems that implement this standard may only implement a subset of this standard in order to be valid and useful (8-bit, 16-bit or 32-bit encoding). One, as the user of a certain software package, should not make the assumption that once the software says it support unicode, it support the universal set of unicode.

Short answer: You can't do this because Mathematica doesn't support these characters properly. See at the end of the post for some workarounds.

Just to clear up some things:

There's no need for a 32-bit encoding to handle more than ~65000 Unicode characters. The most common encodings used for Unicode, UTF-8 and UTF-16 , are multibyte encodings , meaning that a variable number of bytes are used to represent characters. UTF-16 can use either 2 or 4 bytes to represent a character. The Mathematica kernel will interpret every 2-byte sequence as a single character in a string, resulting in some invalid characters on occasion (when encountering a 4-byte sequence). This may be considered a bug. The front end is quite moody about how it handles 4-byte sequences, which is definitely a bug.

Limited workaround

When working strictly in the kernel (eg reading the Unicode data from a file), I sometimes use this function as a workaround to get the actual Unicode code point of 2-unit (4-byte) UTF-16 sequences:

toCodePoint[{a_, b_}] /; 16^^d800 <= a <= 16^^dbff && 16^^dc00 <= b <= 16^^dfff := (a - 16^^d800)*2^10 + (b - 16^^dc00) + 16^4

You can use

Split[ToCharacterCode[str], If[16^^d800 <= # <= 16^^dbff, True] &]

to split a UTF-16 string into Unicode characters correctly (either length-one or length-two, depending on the character).

This is an ugly and inconvenient workaround, and it will won't allow you to display anything of these characters in the front end unless you come up with some hack for that as well, eg importing the glyph reference images from unicode.org (at least for CJK they have them).

See also

See my earlier question on the same topic: Reading an UTF-8 encoded text file in Mathematica

If you are going to work with Chinese, you may come across this other problem too: Getting the Mathematica front end to obey the FontFamily option

According to this page in the Mathematica 8 help:

Mathematica supports both 8- and 16-bit raw character encodings.

Presumably they are saying that they don't support 32-bit encodings as would be needed to support your desired character.

As further evidence (in the absence of a clear statement in the documentation), the list of supported encodings on the same page has no 32-bit encodings. 32-bit encodings are apparently only supported in MathLink. I suppose there hasn't been enough user demand.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM