简体   繁体   English

如何从C#中的字符串中删除非ASCII字

[英]How to remove non-ASCII word from a string in C#

I want to filter some string which has some wrong letters (non- ASCII ). 我想过滤一些有一些错误字母的字符串(非ASCII )。 It looks different in Notepad, Visual Studio 2010 and MySQL. 它在Notepad,Visual Studio 2010和MySQL中看起来有所不同。

How can I check if a string has non-ASCII letters and how I can remove them? 如何检查字符串是否包含非ASCII字母以及如何删除它们?

You could use a regular expression to filter non ASCII characters: 您可以使用正则表达式过滤非ASCII字符:

string input = "AB £ CD";
string result = Regex.Replace(input, "[^\x0d\x0a\x20-\x7e\t]", "");

You could use Regular Expressions. 您可以使用正则表达式。

Regex.Replace(input, "[^a-zA-Z0-9]+", "")

You could also use \\W+ as the pattern to remove any non-character. 您还可以使用\\W+作为模式来删除任何非字符。

This has been a God-send: 这是一个神派:

Regex.Replace(input, @"[^\u0000-\u007F]", "");

I think I got it elsewhere originally, but here is a link to the same answer here: 我想我最初在其他地方得到了它,但这里是相同答案的链接:

How can you strip non-ASCII characters from a string? 如何从字符串中删除非ASCII字符? (in C#) (在C#中)

First, you need to determine what you mean by a "word". 首先,您需要确定“单词”的含义。 If non-ascii, this probably implies non-english? 如果不是ascii,这可能意味着非英语?

Personally, I'd ask why you need to do this and what fundamental assumption has your application got that conflicts with your data? 就个人而言,我会问你为什么需要这样做以及你的应用程序与数据冲突的基本假设是什么? Depending on the situation, I suggest you either re-encode the text from the source encoding, although this will be a lossy conversion, or alternatively, address that fundamental assumption so that your application handles data correctly. 根据具体情况,我建议您重新编码来自源编码的文本,尽管这将是一个有损转换,或者,可以解决这个基本假设,以便您的应用程序正确处理数据。

I think something as simple as this would probably work, wouldn't it? 我觉得这个简单的东西可能会起作用,不是吗?

public static string AsciiOnly(this string input, bool includeExtendedAscii)
{
    int upperLimit = includeExtendedAscii ? 255 : 127;
    char[] asciiChars = input.Where(c => (int)c <= upperLimit).ToArray();
    return new string(asciiChars);
}

Example usage: 用法示例:

string input = "AB£ȼCD";
string asciiOnly = input.AsciiOnly(false); // returns "ABCD"
string extendedAsciiOnly = input.AsciiOnly(true); // returns "AB£CD"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM