简体   繁体   中英

C++ Strip non-ASCII Characters from string

Before you get started; yes I know this is a duplicate question and yes I have looked at the posted solutions. My problem is I could not get them to work.

bool invalidChar (char c)
    return !isprint((unsigned)c); 
void stripUnicode(string & str)
    str.erase(remove_if(str.begin(),str.end(), invalidChar), str.end()); 

I tested this method on "Prusæus, Ægyptians," and it did nothing I also attempted to substitute isprint for isalnum

The real problem occurs when, in another section of my program I convert string->wstring->string. the conversion balks if there are unicode chars in the string->wstring conversion.


How can you strip non-ASCII characters from a string? (in C#)

How to strip all non alphanumeric characters from a string in c++?


I still would like to remove all non-ASCII chars regardless yet if it helps, here is where I am crashing:

// Convert to wstring
wchar_t* UnicodeTextBuffer = new wchar_t[ANSIWord.length()+1];
wmemset(UnicodeTextBuffer, 0, ANSIWord.length()+1);
mbstowcs(UnicodeTextBuffer, ANSIWord.c_str(), ANSIWord.length());
wWord = UnicodeTextBuffer; //CRASH

Error Dialog

MSVC++ Debug Library

Debug Assertion Failed!

Program: //myproject

File: f:\dd\vctools\crt_bld\self_x86\crt\src\isctype.c

Line: //Above



Further compounding the matter: the .txt file I am reading in from is ANSI encoded. Everything within should be valid.


bool invalidChar (char c) 
    return !(c>=0 && c <128);   
void stripUnicode(string & str) 
    str.erase(remove_if(str.begin(),str.end(), invalidChar), str.end());  

If someone else would like to copy/paste this, I can check this question off.


For future reference: try using the __isascii, iswascii commands


bool invalidChar (char c) 
    return !(c>=0 && c <128);   
void stripUnicode(string & str) 
    str.erase(remove_if(str.begin(),str.end(), invalidChar), str.end());  


For future reference: try using the __isascii, iswascii commands

At least one problem is in your invalidChar function. It should be:

return !isprint( static_cast<unsigned char>( c ) );

Casting a char to an unsigned is likely to give some very, very big values if the char is negative ( UNIT_MAX+1 + c). Passing such a value to UNIT_MAX+1 + c). Passing such a value to isprint` is undefined behavior.

Another solution that doesn't require defining two functions but uses anonymous functions available in C++17 above:

void stripUnicode(string & str) 
    str.erase(remove_if(str.begin(),str.end(), [](char c){return !(c>=0 && c <128);}), str.end());  

I think it looks cleaner

isprint depends on the locale, so the character in question must be printable in the current locale.

If you want strictly ASCII, check the range for [0..127]. If you want printable ASCII, check the range and isprint .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM