简体   繁体   中英

C/C++ isspace() skipping multibyte string characters

I have the following function return to strip spaces from string,

char *rtrim(char *l_ptr)
{
    char *lptr = l_ptr + strlen(l_ptr) - 1;
    for (; lptr != l_ptr && isspace((int)*lptr); lptr--)
        ;
    *lptr = '\0';
       return lptr;
}

char *ltrim(char *l_ptr)
{
    char *lptr;
    for (lptr = l_ptr; *lptr != '\0' && isspace((int)*lptr); lptr++)
        ;
    return lptr;
}


char *trim(char *l_ptr) {
return rtrim(ltrim(l_ptr));
}

The problem is its trimming character the following -

removing leading spaces from "

            Ć"

removed leading spaces, resultant ""

The character is 0xc6 with a few spaces before it. I have checked the code to include setlocale(LC_ALL, "");. LANG set to pl_PL.isoo88592. Any help much appreciated.

Thanks.

The problem is how you are calling isspace . isspace only has defined results if the input is in the range [0,UCHAR_MAX] (or is EOF ). On your system, char is probably signed, which means that (int)*lptr will result in a negative value for the accented characters (those with an code point larger than 127), which is not in the legal range.

When calling the one parameter forms of is... (those in <cctype> or <ctype.h> ), you should always cast anything of char type to unsigned char : isspace( static_cast<unsigned char>( *lptr ) ) . (The implicit conversion of unsigned char to int will do the right thing.)

Your rtrim function ends as

*lptr = '\0';
return lptr;

This cannot ever return anything other than what will be seen as an empty string. In trim you then directly return that result.

Depending on how you want these functions to work, you should either make rtrim return the original value of l_ptr , which has remained unchanged and points to the start of the string, or make trim ignore the return value of rtrim .

You would have the same problem with all characters, not just 'Ć' .

If your are working with multibyte characters probably it will be easier to switch to wchar , to avoid unnecessary hassle with char(pointer) manipulations ?

And you can use iswspace for checking if the character is a white-space.

rtrim() has multiple problems.

  1. isspace() is only defined for int in the range unsigned char and EOF. For values outside the range 0 to CHAR_MAX (typically 0 to 127), need to convert to unsigned char before the implicit conversion to int . (@James Kanze)

    C11dr §7.4.1 "... the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF . If the argument has any other value, the behavior is undefined."

  2. char *lptr = l_ptr + strlen("") - 1; is bad as that pointer value is not known to be valid. Need new approach. This also kicks off a long loop with

    for (; lptr != l_ptr ... ; lptr--)

  3. *lptr = '\\0'; return lptr; always retruns "" . @hvd Likely the beginning of the string is desired.

  4. Suggested re-write:

     #include "ctype.h" char *rtrim(char *l_ptr) { unsigned char *ptr = (unsigned char *) l_ptr; unsigned char *end = ptr; while (*ptr) { if (!isspace(*ptr++)) { end = ptr; } } *end = '\\0'; return l_ptr; } 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM