简体   繁体   English

unicode char值

[英]unicode char value

Question: What is the correct order of Unicode extended symbols by value? 问题:按值排列的Unicode扩展符号的正确顺序是什么?

If I excel sort a list of Unicode chars the order is different than if I use the excel "=code()" and sort by those values. 如果我擅长对Unicode字符列表进行排序,则顺序与使用excel“ = code()”并按这些值进行排序的顺序不同。 The purpose is that I want to measure the distance between chars, for example ab = 1 and &-% = 1; 目的是要测量字符之间的距离,例如ab = 1和&-%= 1; when sorted with the excel sort function, two chars that are ordered within three appear to have values that are 134 away. 当使用excel sort函数进行排序时,三个字符之间排序的两个char的值似乎相距134。

Also, some char symbols are blank in excel and several are found twice with 'find' and are two different symbols - and a couple are not found at all. 另外,有些字符符号在excel中为空白,有些字符用'find'找到两次,并且是两个不同的符号-根本找不到两个。 Please explain the details of these 'special' chars. 请解释这些“特殊”字符的详细信息。

http://en.wikipedia.org/wiki/List_of_Unicode_characters http://en.wikipedia.org/wiki/List_of_Unicode_characters

sample code: 样例代码:

int charDist = abs(alpha[index] - code[0]);

EDIT: To figure out the UNICODE values in c++ vs2008 I ran each code as a comparison from code 1 to code 255 against code 1 编辑:要找出c ++ vs2008中的UNICODE值,我将每个代码作为从代码1到代码255与代码1的比较运行

cout << mem << " code " << key << " is " << abs(key[0] - '') << " from " << endl;

In the brackets is a black happy face that this website does not have the font for but the command window does, in vs2008 it looks like a half-post | 括号中是一张黑色的笑脸,该网站没有其字体,但是命令窗口具有该字体,在vs2008中,它看起来像一个半角| with the right half of a T. Excel leaves a blank. 用T的右半部分留出空白。

The following Unicodes are not handled in c++ vs2008 with the std library and #include 9, 10, 13, 26, 34, 44, 以下Unicodes在带有std库和#include 9、10、13、26、34、44的c ++ vs2008中无法处理

And, the numerical 'distance' for codes 1 through 127 are correct, but at 128 the distance skips an extra and is one further away for some reason. 并且,代码1到127的数字“距离”是正确的,但是在128处,该距离跳过了一个额外的距离,由于某种原因,距离又相距一距离。 Then from 128 to 255 the distance reverses and becomes closer; 然后,从128到255,距离反向并变得更近。 255 is 2 away from 1 '' 255离1等于2''

It'd be nice if these followed something more logical and were just 1 through 255 without hiccups or skips and reversals, and 255-1 = 254 but hey, what do I know. 如果这些代码遵循更合理的逻辑,并且从1到255而不出现打or或跳过和反转,并且255-1 = 254,但我知道些什么,那就很好了。

EDIT2: I found it - without the absolute - the collation for UNIFORMAT is 128 to 255 then 1 to 127 and yields 1 to 255 with the 6 skips for 9, 10, 13, 26, 34, 44 that are garbage. EDIT2:我发现它-没有绝对值-UNIFORMAT的排序规则是128到255,然后是1到127,并产生1到255,其中9、10、13、26、34、44的6个跳跃是垃圾。 That was not intuitive. 那不是直觉。 In the new order 128->255,1->127 the strange skip from 127 to 128 is clearer, it is because there is no 0 so the value is missing between 255 and 1. 在新的顺序128-> 255,1-> 127中,从127到128的奇怪跳跃更清晰了,这是因为没有0,所以该值在255和1之间丢失了。

SOLUTION: make my own hashtable with values for each symbol and do not rely on c++ std library or vs2008 to provide the UNIFORMAT values since they are not correct for measuring the char distance outside of several specific subsets of UNIFORMAT. 解决方案:使用每个符号的值制作我自己的哈希表,并且不依赖c ++ std库或vs2008提供UNIFORMAT值,因为它们不适用于测量UNIFORMAT的几个特定子集之外的char距离。

Unicode doesn't have a defined sort (or collation) order. Unicode没有定义的排序(或排序规则)顺序。 When Excel sorts, it's using tables based on the currently selected language. Excel排序时,它使用的是基于当前所选语言的表。 For example, someone using Excel in English mode may get different sorting results that someone using Excel in Portuguese. 例如,以英语模式使用Excel的人可能会获得与以葡萄牙语使用Excel的人不同的排序结果。

There are also issues of normalization. 还有标准化的问题。 With Unicode, one "character" doesn't necessarily correspond to one value. 使用Unicode,一个“字符”不一定对应一个值。 Some characters can be represented in different ways. 某些字符可以用不同的方式表示。 For example, a capital omega can be coded as a Greek letter or as a symbol for representing units of electrical resistance. 例如,大写欧米茄可以编码为希腊字母或表示电阻单位的符号。 In some languages, a single character may be composed from several consecutive values. 在某些语言中,单个字符可以由多个连续的值组成。

The blank values probably correspond to glyphs that you don't have any font coverage for. 空白值可能对应于您没有任何字体覆盖范围的字形。 Some systems use so-called "Unicode fonts" which have a large percentage of the glyphs you need for every script. 一些系统使用所谓的“ Unicode字体”,其中每个脚本所需的字形都占很大比例。 Windows tends to switch fonts on the fly when the current font doesn't have a necessary glyph. 当当前字体没有必要的字形时,Windows倾向于动态切换字体。 Neither approach will have every glyph necessary. 两种方法都没有必要的所有标志符号。 Also, some Unicode values don't encode to a visible glyph (eg, there are many different kinds of spaces in Unicode), some values act more like ASCII-style controls codes (eg, paragraph separator or bidi controls), and some values only make sense when they combine with another character, like many of the "combining" accents. 另外,某些Unicode值未编码为可见的字形(例如,Unicode中有许多不同类型的空格),某些值的行为更像ASCII样式的控件代码(例如,段落分隔符或biddi控件),而某些值仅当它们与另一个字符结合时才有意义,例如许多“结合”的重音符号。

So there's not an answer you're going to be satisfied with. 因此,您将不会满意。 Perhaps if you gave more information about what you're ultimately trying to do, we could suggest a different approach. 也许,如果您提供了有关您最终想要做什么的更多信息,我们可能会建议采用其他方法。

I don't think you can do what you want to do in Excel without limiting your approach significantly. 我认为您不能在Excel中做您想做的事而又不会明显限制您的方法。

By experimentation, the Code function will never return a value higher than 255. If you use any unicode text that cannot be generated via this VBA Code, it will be interpreted as a question mark (?) or 63. 通过实验,“代码”函数将永远不会返回大于255的值。如果您使用无法通过此VBA代码生成的任何unicode文本,它将被解释为问号(?)或63。

For x = 1 To 255
    Cells(x, 1).Value = Chr(x)
Next

You should be able to determine the difference using Code. 您应该能够使用Code来确定差异。 But if the character doesn't fall in that realm, you'll need to go outside of Excel, because even VBA will convert any other Unicode characters to the question mark(?) or 63. 但是,如果该字符不属于该领域,则您将需要使用Excel,因为VBA还将其他任何Unicode字符转换为问号(?)或63。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM