简体   繁体   English

删除其中包含特殊字符的单词

[英]Removing words with special characters in them

I have a long string composed of a number of different words. 我有一个由许多不同单词组成的长字符串。

I want to go through all of them, and if the word contains a special character or number (except '-'), or starts with a Capital letter, I want to delete it (the whole word not just that character). 我想通过所有这些,如果单词包含一个特殊的字符或数字(除了' - '),或者以大写字母开头,我想删除它(整个单词不仅仅是那个字符)。 For all intents and purposes 'foreign' letters can count as special characters. 对于所有意图和目的,“外国”字母可以算作特殊字符。

The obvious solution is to run a loop through each word (after splitting it) and then a loop through each character - but I'm hoping there's a faster way of doing it? 显而易见的解决方案是在每个单词之后运行循环(在分割之后)然后循环遍历每个字符 - 但是我希望有更快的方法来执行它? Perhaps using Regex but I've almost no experience with it. 也许使用正则表达式,但我几乎没有经验。

Thanks 谢谢

ADDED: 添加:

(What I want for example:) (我想要的例子:)

Input: "this Is an Example of 5 words in an input like-so from example.com" 输入:“这是来自example.com的输入中的5个单词的示例”

Output: {this,an,of,words,in,an,input,like-so,from} 输出:{this,an,of,words,in,an,input,like-so,from}

(What I've tried so far) (到目前为止我尝试过的)

List<string> response = new List<string>();

string[] splitString = text.Split(' ');

foreach (string s in splitString)
{
    bool add = true;
    foreach (char c in s.ToCharArray())
    {
         if (!(c.Equals('-') || (Char.IsLetter(c) && Char.IsLower(c))))
         {
             add = false;
             break;
         }
         if (add)
         {
             response.Add(s);
         }
    }
}

Edit 2: 编辑2:

For me a word should be a number of characters (a..z) seperated by a space. 对我来说,一个单词应该是一个由空格分隔的多个字符(a..z)。 ,/./!/... at the end shouldn't count for the 'special character' condition (which is really mostly just to remove urls or the like) ,/ ./!/ ...最后不应该计入'特殊字符'条件(这主要是为了删除网址等)

So: "I saw a dog. It was black!" 所以:“我看到一只狗。它是黑色的!” should result in {saw,a,dog,was,black} 应该导致{saw,a,dog,was,black}

So you want to find all "words" that only contain characters az or - , for words that are separated by spaces? 所以你想要找到只包含字符az-所有“单词”,对于用空格分隔的单词?

A regex like this will find such words: 像这样的正则表达式会找到这样的词:

(?<!\S)[a-z-]+(?!\S)

To also allow for words that end with single punctuation, you could use: 要允许以单个标点符号结尾的单词,您可以使用:

(?<!\S)[a-z-]+(?=[,.!?:;]?(?!\S))

Example ( ideone ): 示例( ideone ):

var re = @"(?<!\S)[a-z-]+(?=[,.!?:;]?(?!\S))";
var str = "this, Is an! Example of 5 words in an input like-so from example.com foo: bar?";

var m = Regex.Matches(str, re);

Console.WriteLine("Matched: ");
foreach (Match i in m)
    Console.Write(i + " ");

Notice the punctuation in the string. 注意字符串中的标点符号。

Output: 输出:

Matched: 
this an of words in an input like-so from foo bar 

How about this? 这个怎么样?

(?<=^|\\s+)(?[az-]+)(?=$|\\s+) (?<= ^ | \\ S +)([AZ - ] +?)(= $ |?\\ S +)

Edit: Meant (?<=^|\\s+)(?<word>[az\\-]+)(?=(?:\\.|,|!|\\.\\.\\.)?(?:$|\\s+)) 编辑:Meant (?<=^|\\s+)(?<word>[az\\-]+)(?=(?:\\.|,|!|\\.\\.\\.)?(?:$|\\s+))

Rules: 规则:

  1. Word can only be preceded by start of line or some number of whitespace characters Word只能以行首或一些空白字符开头
  2. Word can only be followed by end of line or some number of whitespace characters (Edit supports words ending with periods, commas, exclamation points, and ellipses) Word后面只能跟行结尾或一些空格字符(编辑支持以句点,逗号,感叹号和省略号结尾的单词)
  3. Word can only contain lower case (latin) letters and dashes Word只能包含小写(拉丁)字母和短划线

The named group containing each word is "word" 包含每个单词的命名组是“单词”

看看微软的如何:使用正则表达式搜索字符串(C#编程指南) - 它是关于C#中的正则表达式。

List<string> strings = new List<string>() {"asdf", "sdf-sd", "sdfsdf"};

for (int i = strings.Count-1; i > 0; i--)
{
   if (strings[i].Contains("-"))
   {
       strings.Remove(strings[i]);
   }
}

This could be a starting point. 这可能是一个起点。 right now it just checks only for "." 现在它只检查“。” as a special char. 作为一个特殊的char。 This outputs : "this an of words in an like-so from" 这输出:“这个词在一个像 - 所以从”

        string pattern = @"[A-Z]\w+|\w*[0-9]+\w*|\w*[\.]+\w*";
        string line = "this Is an Example of 5 words in an in3put like-so from example.com";

        System.Text.RegularExpressions.Regex r = new System.Text.RegularExpressions.Regex(pattern);
        line = r.Replace(line,"");

You can do this in two ways, the white-list way and the black-list way. 您可以通过两种方式执行此操作,即白名单方式和黑名单方式。 With a white-list you define the set of characters that you consider to be acceptable and with the black-list its the opposite. 使用白名单,您可以定义您认为可接受的字符集,而黑名单则相反。

Lets assume the white-list way and that you accept only characters az , AZ and the - character. 让我们假设白名单方式,你只接受字符azAZ-字符。 Additionally you have the rule that the first character of a word cannot be an upper case character. 此外,您还有一个规则,即单词的第一个字符不能是大写字符。

With this you can do something like this: 有了这个你可以做这样的事情:

string target = "This is a white-list example: (Foo, bar1)";

var matches = Regex.Matches(target, @"(?:\b)(?<Word>[a-z]{1}[a-zA-Z\-]*)(?:\b)");

string[] words = matches.Cast<Match>().Select(m => m.Value).ToArray();

Console.WriteLine(string.Join(", ", words));

Outputs: 输出:

// is, a, white-list, example

You can use look-aheads and look-behinds to do this. 您可以使用前瞻和后视来执行此操作。 Here's a regex that matches your example: 这是一个与你的例子匹配的正则表达式:

(?<=\s|^)[a-z-]+(?=\s|$)

The explanation is: match one or more alphabetic characters (lowercase only, plus hyphen), as long as what comes before the characters is whitespace (or the start of the string), and as long as what comes after is whitespace or the end of the string. 解释是:匹配一个或多个字母字符(仅小写,加连字符),只要字符前面的内容是空格(或字符串的开头),并且只要后面的内容是空格或结尾字符串。

All you need to do now is plug that into System.Text.RegularExpressions.Regex.Matches(input, regexString) to get your list of words. 您现在需要做的就是将其插入System.Text.RegularExpressions.Regex.Matches(input, regexString)以获取您的单词列表。

Reference: http://www.mikesdotnetting.com/Article/46/CSharp-Regular-Expressions-Cheat-Sheet 参考: http//www.mikesdotnetting.com/Article/46/CSharp-Regular-Expressions-Cheat-Sheet

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM