简体   繁体   English

C++ 从字符串中去除非 ASCII 字符

[英]C++ Strip non-ASCII Characters from string

Before you get started;在你开始之前; yes I know this is a duplicate question and yes I have looked at the posted solutions.是的,我知道这是一个重复的问题,是的,我已经查看了发布的解决方案。 My problem is I could not get them to work.我的问题是我无法让他们工作。

bool invalidChar (char c)
{ 
    return !isprint((unsigned)c); 
}
void stripUnicode(string & str)
{
    str.erase(remove_if(str.begin(),str.end(), invalidChar), str.end()); 
}

I tested this method on "Prusæus, Ægyptians," and it did nothing I also attempted to substitute isprint for isalnum我在“Prusæus, Ægyptians”上测试了这个方法,但它什么也没做我还试图用isprint代替isalnum

The real problem occurs when, in another section of my program I convert string->wstring->string.当我在程序的另一部分转换 string->wstring->string 时,就会出现真正的问题。 the conversion balks if there are unicode chars in the string->wstring conversion.如果 string->wstring 转换中有 unicode 字符,则转换会停止。

Ref:参考:

How can you strip non-ASCII characters from a string? 如何从字符串中去除非 ASCII 字符? (in C#) (在 C# 中)

How to strip all non alphanumeric characters from a string in c++? 如何从 C++ 中的字符串中去除所有非字母数字字符?

Edit:编辑:

I still would like to remove all non-ASCII chars regardless yet if it helps, here is where I am crashing:我仍然想删除所有非 ASCII 字符,不管它是否有帮助,这就是我崩溃的地方:

// Convert to wstring
wchar_t* UnicodeTextBuffer = new wchar_t[ANSIWord.length()+1];
wmemset(UnicodeTextBuffer, 0, ANSIWord.length()+1);
mbstowcs(UnicodeTextBuffer, ANSIWord.c_str(), ANSIWord.length());
wWord = UnicodeTextBuffer; //CRASH

Error Dialog错误对话框

MSVC++ Debug Library MSVC++ 调试库

Debug Assertion Failed!调试断言失败!

Program: //myproject程序://我的项目

File: f:\dd\vctools\crt_bld\self_x86\crt\src\isctype.c文件:f:\dd\vctools\crt_bld\self_x86\crt\src\isctype.c

Line: //Above行://以上

Expression:(unsigned)(c+1)<=256表达式:(无符号)(c+1)<=256

Edit:编辑:

Further compounding the matter: the .txt file I am reading in from is ANSI encoded.更复杂的是:我从中读取的 .txt 文件是 ANSI 编码的。 Everything within should be valid.里面的一切都应该是有效的。

Solution:解决方案:

bool invalidChar (char c) 
{  
    return !(c>=0 && c <128);   
} 
void stripUnicode(string & str) 
{ 
    str.erase(remove_if(str.begin(),str.end(), invalidChar), str.end());  
}

If someone else would like to copy/paste this, I can check this question off.如果其他人想复制/粘贴这个,我可以勾选这个问题。

EDIT:编辑:

For future reference: try using the __isascii, iswascii commands供将来参考:尝试使用__isascii、iswascii命令

Solution:解决方案:

bool invalidChar (char c) 
{  
    return !(c>=0 && c <128);   
} 
void stripUnicode(string & str) 
{ 
    str.erase(remove_if(str.begin(),str.end(), invalidChar), str.end());  
}

EDIT:编辑:

For future reference: try using the __isascii, iswascii commands供将来参考:尝试使用 __isascii、iswascii 命令

At least one problem is in your invalidChar function.至少有一个问题出在您的invalidChar函数中。 It should be:它应该是:

return !isprint( static_cast<unsigned char>( c ) );

Casting a char to an unsigned is likely to give some very, very big values if the char is negative ( UNIT_MAX+1 + c). Passing such a value to如果char为负数 ( UNIT_MAX+1 + c). Passing such a value to ,则将char转换为unsigned可能会给出一些非常非常大的值。 UNIT_MAX+1 + c). Passing such a value to isprint` is undefined behavior. UNIT_MAX+1 + c). Passing such a value to isprint` 是未定义的行为。

Another solution that doesn't require defining two functions but uses anonymous functions available in C++17 above:另一个不需要定义两个函数但使用上面 C++17 中可用的匿名函数的解决方案:

void stripUnicode(string & str) 
{ 
    str.erase(remove_if(str.begin(),str.end(), [](char c){return !(c>=0 && c <128);}), str.end());  
}

I think it looks cleaner我觉得它看起来更干净

isprint depends on the locale, so the character in question must be printable in the current locale. isprint取决于语言环境,因此相关字符必须在当前语言环境中可打印。

If you want strictly ASCII, check the range for [0..127].如果您想要严格的 ASCII,请检查 [0..127] 的范围。 If you want printable ASCII, check the range and isprint .如果您想要可打印的 ASCII,请检查范围和isprint

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM