简体   繁体   English

字符编码独立字符交换

[英]Character Encoding independent character swap

I like to use this piece of code when I want to reverse a string. 当我想反转一个字符串时,我喜欢使用这段代码。 [When I am not using std::string or other inbuilt functions in C ] . [当我不使用std::string或其他内置功能C ]。 As a beginner when I initially thought of this I had ASCII table in mind. 作为初学者,当我最初想到这一点时,我想到了ASCII表。 I think this can work well with Unicode too. 我认为这也可以与Unicode一起使用。 I assumed since the difference in values (ASCII etc) is fixed, so it works. 我假设由于值的差异(ASCII等)是固定的,因此可以正常工作。

Are there any character encodings in which this code may not work? 是否有任何字符编码无法在其中使用?

char a[11],t;
int len,i;
strcpy(a,"Particl");    
printf("%s\n",a);
len = strlen(a);
for(i=0;i<(len/2);i++)
{
    a[i] += a[len-1-i];
    a[len-1-i] = a[i] - a[len-1-i];
    a[i] -= a[len-1-i];
}
printf("%s\n",a);

Update: 更新:

This link is informative in association with this question. 链接与该问题相关,是信息丰富的。

This will not work with any encoding in which some (not necessarily all) codepoints require more than one char unit to represent, because you are reversing byte-by-byte instead of codepoint-by-codepoint. 这不适用于某些(不一定是全部)代码点需要多个char单位表示的编码,因为您是逐字节地而不是逐个代码点地反转。 For the usual 8-bit char this includes all encodings that can represent all of Unicode. 对于通常的8位char这包括可以表示所有Unicode的所有编码。

For example: in UTF-16BE, the string "hello" maps to the byte sequence 00 68 00 65 00 6c 00 6c 00 6f . 例如:在UTF-16BE中,字符串“ hello”映射到字节序列00 68 00 65 00 6c 00 6c 00 6f Your algorithm applied to this byte sequence will produce the sequence 6f 00 6c 00 6c 00 65 00 68 00 , which is the UTF-16BE encoding of the string "漀氀氀攀栀". 应用于此字节序列的算法将产生序列6f 00 6c 00 6c 00 65 00 68 00 ,这是字符串“漀氀氀攀栀”的UTF-16BE编码。

It gets worse -- doing a codepoint-by-codepoint reversal of a Unicode string still won't produce the correct results in all cases, because Unicode has many codepoints that act on their surroundings rather than standing alone as characters. 情况变得更糟-在所有情况下,对Unicode字符串逐个代码点反转仍然无法产生正确的结果,因为Unicode具有许多作用于周围环境的代码点,而不是单独作为字符。 As a trivial example, codepoint-reversing the string "Spın̈al Tap", which contains U+0308 COMBINING DIAERESIS, will produce "paT länıpS" -- see how the diaeresis has migrated from the N to the A? 举一个简单的例子,对包含“ U + 0308 COMBINING DIAERESIS”的字符串“Spın̈alTap”进行代码点反转将产生“ paTlänıpS”,请问透尿症如何从N迁移到A? The consequences of codepoint-by-codepoint reversal on a string containing bidirectional overrides or conjoining jamo would be even more dire. 在包含双向覆盖或联合jamo的字符串上逐个代码点反转的后果将更加可怕。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM