简体   繁体   English

从字符串中删除无法识别的ASCII字符

[英]Removing unrecognized ASCII characters from string

I'm parsing html using HTML Agility Pack and from time to time I get weird looking strings like:"–". 我正在使用HTML Agility Pack解析html,时不时出现一些奇怪的字符串,例如:“—。 What is the simplest way to remove them ? 删除它们的最简单方法是什么? By the way, I'm using C#. 顺便说一句,我正在使用C#。

You probably need to look into why you are getting those characters in the first place, and it will likely be something is wrong with the encoding 您可能需要研究为什么首先要获取这些字符,并且编码可能有问题

But if you do need to remove all the non-ascii characters from a string, the regex [^ -~] does the trick 但是,如果您确实需要从字符串中删除所有非ASCII字符,则使用正则表达式[^-〜]可以解决问题

        var stripped = Regex.Replace("străipped of baâ€d charâ€cters", "[^ -~]", "");
        Console.WriteLine(stripped); //outputs "stripped of bad characters"

see http://www.catonmat.net/blog/my-favorite-regex/ for the explanation of why that regex works 请参阅http://www.catonmat.net/blog/my-favorite-regex/了解有关该正则表达式工作原理的说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM