简体   繁体   English

为什么`strchr`似乎与多字节字符一起工作,尽管手册免责声明?

[英]Why `strchr` seems to work with multibyte characters, despite man page disclaimer?

From: 从:

man strchr

char *strchr(const char *s, int c); char * strchr(const char * s,int c);

The strchr() function returns a pointer to the first occurrence of the character c in the string s. strchr()函数返回指向字符串s中第一次出现的字符c的指针。

Here "character" means "byte"; 这里“字符”表示“字节”; these functions do not work with wide or multibyte characters. 这些函数不适用于宽字符或多字节字符。

Still, if I try to search a multi-byte character like é ( 0xC3A9 in UTF-8): 不过,如果我尝试搜索像é这样的多字节字符(UTF-8中的0xC3A9 ):

const char str[] = "This string contains é which is a multi-byte character";
char * pos = strchr(str, (int)'é');
printf("%s\n", pos);
printf("0x%X 0x%X\n", pos[-1], pos[0]); 

I get the following output: 我得到以下输出:

which is a multi-byte character 这是一个多字节字符

0xFFFFFFC3 0xFFFFFFA9 0xFFFFFFC3 0xFFFFFFA9

Despite the warning: 尽管有警告:

warning: multi-character character constant [-Wmultichar] 警告:多字符字符常量[-Wmultichar]

So here are my questions: 所以这是我的问题:

  • What does it means strchr doesn't work with multi-byte characters ? 什么意味着strchr不能用于多字节字符? (it seems to work, provided int type is big enough to contains your multi-byte that can be at most 4 bytes) (它似乎工作,只要int类型足够大,包含你的多字节,最多可以是4个字节)
  • How to get rid of the warning, ie how to safely recover the mult-byte value and store it in an int ? 如何摆脱警告,即如何安全地恢复多字节值并将其存储在int中?
  • Why the prefixes 0xFFFFFF ? 为什么前缀为0xFFFFFF

strchr() only seems to work for your multi-byte character. strchr()似乎只适用于您的多字节字符。

The actual string in memory is 内存中的实际字符串是

... c, o, n, t, a, i, n, s, ' ', 0xC3, 0xA9, ' ', w ... ... c,o,n,t,a,i,n,s,'',0xC3,0xA9,'',w ...

When you call strchr() , you are really only searching for the 0xA9 , which are the lower 8 bits. 当你调用strchr() ,你实际上只搜索0xA9 ,它是低8位。 That's why pos[-1] has the first byte of your multi-byte character: it was ignored during the search. 这就是pos[-1]具有多字节字符的第一个字节的原因:在搜索过程中它被忽略了。

A char is signed on your system, which is why your characters are sign extended (the 0xFFFFFF ) when you print them out. 在您的系统上签署了一个char ,这就是为什么在打印出来时你的字符符号扩展( 0xFFFFFF )的原因。

As for the warning, it seems that the compiler is trying to tell you that you are doing something odd, which you are. 至于警告,似乎编译器试图告诉你,你正在做一些奇怪的事情,你就是这样。 Don't ignore it. 不要忽视它。

That's the problem. 那就是问题所在。 It seems to work. 它似乎工作。 Firstly, it's entirely up to the compiler what it puts in the string if you put multibyte characters in it, if indeed it compiles it at all. 首先,如果你在其中放入多字节字符,它完全取决于编译器在字符串中放置的内容,如果它确实根本编译它。 Clearly you are lucky (for some appropriate interpretation of lucky) in that it has filled your string with 很明显,你很幸运(对于幸运的一些恰当的解释),因为它已经填满了你的字符串

.... c3, a9, ' ', 'w', etc

and that you are looking for c3a9 , as it can find that fairly easily. 并且你正在寻找c3a9 ,因为它可以很容易地找到它。 The man page on strchr says: strchr的手册页说:

The strchr() function returns a pointer to the first occurrence of c (converted to a char) in string s strchr()函数返回指向字符串s中第一次出现的c(转换为char)的指针

So you pass c3a9 to this, which is converted to a char with value 'a9'. 所以你将c3a9传递给它,它被转换为值为'a9'的char It finds the a9 character, and you get returned a pointer to it. 它找到了a9字符,并返回指向它的指针。

The ffffff prefix is because you are outputting a signed character as a 32 bit hex number, so it sign extends it for you. ffffff前缀是因为您输出的是有符号字符作为32位十六进制数字,因此它会为您扩展它。 This is as expected. 这是预期的。

The problem is that 'undefined behaviour' is just that. 问题在于“未定义的行为”就是这样。 It might work almost correctly. 它几乎可以正常工作。 And it might not, depending on circumstances. 它可能不会,视情况而定。

And again it is almost. 而且几乎是。 You are not getting a pointer to the multibyte character, you are getting a pointer to the middle of it, (and I'm surprised you're interpreting that as working). 你没有获得指向多字节字符的指针,你得到一个指向它中间的指针,(我很惊讶你把它解释为工作)。 If the multibyte character had evaluated to 0xff20 you'd get pointed to somewhere much earlier in the string. 如果多字节字符已经评估为0xff20,那么您将被指向字符串中较早的某个位置。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM