简体   繁体   English

忽略字符串中的多个正则表达式,对同一字符串进行无重叠结果检查

[英]Multiple regex checks on same string with no overlapping results, while ignoring words

I have a string in which I want to check for the existence of 2 different types of item. 我有一个字符串,我想在其中检查2种不同类型的物品是否存在。 These two types are not mutually exclusive, so I want to also avoid overlap. 这两种类型不是互斥的,所以我也想避免重叠。 They can also occur in any order. 它们也可以以任何顺序发生。 There are also words within the string that should be ignored, even though they fit the regex pattern. 字符串中还有一些单词,即使它们适合正则表达式模式,也应忽略。

  1. There must be one alpha only item, one to many characters: [a-zA-Z]+ . 必须只有一个字母项目,一个到多个字符: [a-zA-Z]+
  2. Another item needs to be alphanumeric, also one to many characters: [a-zA-Z0-9]+ . 另一项必须是字母数字,也是一个至多个字符: [a-zA-Z0-9]+
  3. The alphanumeric item cannot also satisfy the criteria of the alpha-only item, and vice versa. 字母数字项目也不能满足仅字母项目的条件,反之亦然。
  4. The items in an exclusion list should be ignored. 排除列表中的项目应忽略。

I tried following the post Regex: I want this AND that AND that... in any order , but I still can't figure out how to exclude the words I need, and I could not figure out how to leverage that answer so that one word didn't satisfy both alphanumeric and alpha only criteria. 我尝试过正则表达式后的文章:我想要这个AND AND AND ......以任何顺序 ,但是我仍然想不出如何排除我需要的单词,而且我也想不出如何利用这个答案,这样一个词不满足字母数字和仅字母的条件。

This is what I'm currently doing, and it seems to be working, just not very concise. 这是我目前正在做的,而且似乎很有效,只是不够简洁。 If possible, I'd like to learn how I can expand this out to a single regex check. 如果可能,我想学习如何将其扩展到单个正则表达式检查。 Apart from not being super concise, I feel that regex will be safer down the road in case I end up needing to add more conditions. 除了不够简洁之外,我觉得如果我最终需要添加更多条件,则正则表达式会更加安全。

bool bHasAlpha = false;
bool bHasAlphaNum = false;
string Test = "123 ABC SomeWord A12";   //The string to check against.
string[] RemoveWords { "ABC", "DEF" };  //I don't want these matches to count, if found.  

//Split my string into "tokens" and check each individually, ignoring the RemoveWords.
string[] TestTokens = Test.Split(' ')
            .Select(s => s)
            .Where(w => !RemoveWords.Contains(w, StringComparer.OrdinalIgnoreCase))
            .ToArray();

foreach (string s in TestTokens)
{
    //Is this item alpha-only? (Checking this before alphanumeric)
    if (!bHasAlpha && Regex.IsMatch(s, @"^[a-zA-Z]+$"))
        bHasAlpha = true;
    //Is this item alphanumeric?
    else if (!bHasAlphaNum && Regex.IsMatch(s, @"^[a-zA-Z0-9]+$"))
        bHasAlphaNum = true;
}

if (bHasAlpha && bHasAlphaNum)
    Console.WriteLine("String Passes!");

In the test code above, the string would pass because "123" is caught by the alphanumeric check, and "SomeWord" is caught by the alpha-only check. 在上面的测试代码中,该字符串将通过,因为字母数字检查捕获了“ 123”,而纯字母检查捕获了“ SomeWord”。 "ABC" was not because I purposely ignore it. “ ABC”并不是因为我故意忽略它。

Examples of strings that should fail: 失败的字符串示例:

  • "123 abc 456" (abc ignored, no valid alpha-only item) "123 abc 456" (忽略abc,没有有效的仅限Alpha项)
  • "X" (X can satisfy either alpha or alphanumeric, not both) "X" (X可以满足字母或字母数字,但不能两者都满足)
  • "ABC DEF 123 456" (ABC and DEF ignore, no valid alpha-only item) "ABC DEF 123 456" (ABC和DEF忽略,没有有效的仅Alpha项)

The following should pass: 以下应该通过:

  • "ABCDEF 123" (ABCDEF as a whole word are not considered the same as ABC and DEF separately) "ABCDEF 123" (整个单词ABCDEF分别被认为与ABC和DEF相同)
  • "XX" (2 "words", neither are in the excluded list. One satisfies the alphanumeric criteria, on the alpha-only.) "XX" (2个“单词”,都不在排除列表中。一个满足字母数字标准,仅在字母上。)
  • "ABC XYZ ABC DEF A1B2 ABC" (XYZ is alpha, A1B2 is alphanumeric) "ABC XYZ ABC DEF A1B2 ABC" (XYZ是字母,A1B2是字母数字)
  • 123 XYZ (order of the 2 items does not matter. Alpha-only can be 2nd) 123 XYZ(2个项目的顺序无关紧要。仅Alpha可以是2nd)

I'm afraid a single regex doing all these would be overly complicated if possible at all. 恐怕如果可能的话,一个正则表达式执行所有这些操作会过于复杂。 However, you may simplify the code by using regexes in two steps: 但是,您可以通过两步使用正则表达式来简化代码:

  1. Remove all the words to be ignored with this regex: 使用此正则表达式删除所有要忽略的单词:

     \\b(?:ABC|DEF)\\b 
  2. Check if the remaining string matches the "alpha & alphanumeric" condition (pseudo code below for the sake of clarity): 检查剩余的字符串是否符合“字母和字母数字”条件(为清楚起见,下面的伪代码):

     ALPHA.*ALNUM|ALNUM.*ALPHA 

In C#: 在C#中:

var removeRegex = new Regex(@"\b(?:" + string.Join("|", RemoveWords) + @")\b", RegexOptions.IgnoreCase);
var alpha = @"\b[a-z]+\b";
var alnum = @"\b[a-z0-9]+\b";
var matchRegex = new Regex(string.Format(@"{0}.*{1}|{1}.*{0}", alpha, alnum), RegexOptions.IgnoreCase);
foreach (var s in testStrings)
{
    var ok = matchRegex.Match(removeRegex.Replace(s, "")).Success;
    Console.WriteLine("{0}:\t{1}", ok ? "OK" : "Failed", s);
}

Demo: https://dotnetfiddle.net/Ms3uKV 演示: https : //dotnetfiddle.net/Ms3uKV

When you found an Alpha item, you need to skip the check. 找到Alpha项目后,您需要跳过检查。

foreach (string s in TestTokens)
{
    //Is this item alpha-only? (Checking this before alphanumeric)
    if (!bHasAlpha && Regex.IsMatch(s, @"^[a-zA-Z]+$")) // Add !bHasAlpha to skip chekcing if you have found one
        bHasAlpha = true;
    //Is this item alphanumeric?
    else if (Regex.IsMatch(s, @"^[a-zA-Z0-9]+$"))
        bHasAlphaNum = true;
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM