简体   繁体   中英

unicode char value

Question: What is the correct order of Unicode extended symbols by value?

If I excel sort a list of Unicode chars the order is different than if I use the excel "=code()" and sort by those values. The purpose is that I want to measure the distance between chars, for example ab = 1 and &-% = 1; when sorted with the excel sort function, two chars that are ordered within three appear to have values that are 134 away.

Also, some char symbols are blank in excel and several are found twice with 'find' and are two different symbols - and a couple are not found at all. Please explain the details of these 'special' chars.

http://en.wikipedia.org/wiki/List_of_Unicode_characters

sample code:

int charDist = abs(alpha[index] - code[0]);

EDIT: To figure out the UNICODE values in c++ vs2008 I ran each code as a comparison from code 1 to code 255 against code 1

cout << mem << " code " << key << " is " << abs(key[0] - '') << " from " << endl;

In the brackets is a black happy face that this website does not have the font for but the command window does, in vs2008 it looks like a half-post | with the right half of a T. Excel leaves a blank.

The following Unicodes are not handled in c++ vs2008 with the std library and #include 9, 10, 13, 26, 34, 44,

And, the numerical 'distance' for codes 1 through 127 are correct, but at 128 the distance skips an extra and is one further away for some reason. Then from 128 to 255 the distance reverses and becomes closer; 255 is 2 away from 1 ''

It'd be nice if these followed something more logical and were just 1 through 255 without hiccups or skips and reversals, and 255-1 = 254 but hey, what do I know.

EDIT2: I found it - without the absolute - the collation for UNIFORMAT is 128 to 255 then 1 to 127 and yields 1 to 255 with the 6 skips for 9, 10, 13, 26, 34, 44 that are garbage. That was not intuitive. In the new order 128->255,1->127 the strange skip from 127 to 128 is clearer, it is because there is no 0 so the value is missing between 255 and 1.

SOLUTION: make my own hashtable with values for each symbol and do not rely on c++ std library or vs2008 to provide the UNIFORMAT values since they are not correct for measuring the char distance outside of several specific subsets of UNIFORMAT.

Unicode doesn't have a defined sort (or collation) order. When Excel sorts, it's using tables based on the currently selected language. For example, someone using Excel in English mode may get different sorting results that someone using Excel in Portuguese.

There are also issues of normalization. With Unicode, one "character" doesn't necessarily correspond to one value. Some characters can be represented in different ways. For example, a capital omega can be coded as a Greek letter or as a symbol for representing units of electrical resistance. In some languages, a single character may be composed from several consecutive values.

The blank values probably correspond to glyphs that you don't have any font coverage for. Some systems use so-called "Unicode fonts" which have a large percentage of the glyphs you need for every script. Windows tends to switch fonts on the fly when the current font doesn't have a necessary glyph. Neither approach will have every glyph necessary. Also, some Unicode values don't encode to a visible glyph (eg, there are many different kinds of spaces in Unicode), some values act more like ASCII-style controls codes (eg, paragraph separator or bidi controls), and some values only make sense when they combine with another character, like many of the "combining" accents.

So there's not an answer you're going to be satisfied with. Perhaps if you gave more information about what you're ultimately trying to do, we could suggest a different approach.

I don't think you can do what you want to do in Excel without limiting your approach significantly.

By experimentation, the Code function will never return a value higher than 255. If you use any unicode text that cannot be generated via this VBA Code, it will be interpreted as a question mark (?) or 63.

For x = 1 To 255
    Cells(x, 1).Value = Chr(x)
Next

You should be able to determine the difference using Code. But if the character doesn't fall in that realm, you'll need to go outside of Excel, because even VBA will convert any other Unicode characters to the question mark(?) or 63.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM