简体   繁体   English

正则表达式模式混淆 c#

[英]regex pattern confuse c#

I'm trying to write a basic function that takes an input text, creates regex for this input and returns all output as a collection.我正在尝试编写一个基本的 function ,它接受输入文本,为此输入创建正则表达式并将所有 output 作为集合返回。
I wrote this:我写了这个:

string pattern =  @"(\wh*al*re)";  // take this pattern from outside 
Regex rg = new Regex(pattern, RegexOptions.IgnoreCase);
MatchCollection matchedAuthors = rg.Matches(authors);
for (int count = 0; count < matchedAuthors.Count; count++)
{
    Console.WriteLine(count);
    Console.WriteLine(matchedAuthors[count].Value);
}

my text --> "asdad healthcare basdasd"我的文字->“asdad Healthcare basdasd”
but if I'm given the pattern h al re my regex pattern looks like this --> (\wh al re)但是如果我给定了模式 h al re 我的正则表达式模式看起来像这样 --> (\wh al re)
and output is --> "are"和 output 是 --> “是”

Expected behaviour预期行为

Input: h*al*re Output: healthcare

What is the problem in my regex?我的正则表达式有什么问题?

The solution is解决方案是

(\bh\w*al\w*re)

thanks to @anubhava感谢@anubhava

what is problem in my regex?我的正则表达式有什么问题?

Regex is not like DOS filename wildcards正则表达式不像 DOS 文件名通配符

In DOS then h*al*re really would match "healthcare" because * in DOS means "zero or more of any character"在 DOS 中, h*al*re真的会匹配“healthcare”,因为*在 DOS 中表示“任何字符的零个或多个”

In Regex the meaning is subtly different;在正则表达式中,含义略有不同; it means "zero or more of the thing to the left of the asterisk"它的意思是“星号左边的零个或多个”

  • h* - means zero or more h characters in a row h* - 表示一行中的零个或多个 h 字符
  • l* - means zero or more l characters in a row l* - 表示一行中的零个或多个 l 个字符

This means that h*al*re will match something like "hhhhhhhhhallllllllre" or "hhalllllllllllllllllllllllllllllllre" or (as you have found) "are" which is zero "h", then "a", then zero "l", then "re" - it fully complies with a pattern that asks for zero or more "h"这意味着h*al*re将匹配诸如“hhhhhhhhhallllllllre”或“hhalllllllllllllllllllllllllllllllre”或(如您所见)“are”,它是零“h”,然后是“a”,然后是零“l”,然后是“re " - 它完全符合要求零个或多个 "h" 的模式

What you need to do is combine * with another Regex construct such as .您需要做的是将*与另一个 Regex 构造(例如. which means "any single character".这意味着“任何单个字符”。

When you put .* it means "match zero or more of: any single character"当你放.*时,它的意思是“匹配零个或多个:任何单个字符”

Thus your Regex to match "healthcare" is h.*al.*re因此,匹配“healthcare”的正则表达式是h.*al.*re

Note that it would also match heealthcare, hzzzzzzalzzzzzzre etc..请注意,它也将匹配 heealthcare、hzzzzzzalzzzzzzre 等。


the next thing you have to contend with is the concept of greedy vs pessimistic matching接下来你要应对的是贪婪与悲观匹配的概念

.* is greedy; .*是贪婪的; it tries to match as much as possible.它试图尽可能地匹配。 This means it consumes the entire input then spits it back out a char at a time trying to make the match succeed这意味着它会消耗整个输入,然后一次将其吐出一个字符以尝试使匹配成功

If you had a sentence of "the biggest issue in healthcare is that healthcare providers are overloaded everywhere" and you ran your Regex on it your h.*a.*re ends up matching "the biggest issue in h ealthcare is that healthcare providers are overlo a ded everywhe re "如果您有一句话“医疗保健中最大的问题是医疗保健提供者无处不在”并且您在其上运行您的正则表达式,那么您的h h.*a.*re最终匹配“医疗保健中最大的问题是医疗保健提供者是无处不 _

The bold bits are the fixed characters in your regex (the "h", the "a" and the "re") and the italic bits are what the .* are matching - this is what you get when you try to match as much as possible粗体位是正则表达式中的固定字符(“h”、“a”和“re”),斜体位是.*匹配的内容 - 这是您尝试匹配时得到的内容尽可能

You probably want pessimistic matching where the matched tries to match as little as possible rather than as much as possible, and for that you need another modifier to change the behavior of the *, which is done by putting a?您可能想要悲观匹配,其中匹配项尝试尽可能少地匹配而不是尽可能多地匹配,为此您需要另一个修饰符来更改 * 的行为,这是通过放置 a? after the *之后 *

.*? will modify the * so that rather than consuming the entire input and then working backwards, it works forwards looking for a match, so h.*?a.*?re matches just "healthcare", but it also matches "hare"..将修改 * 以便它不会消耗整个输入然后向后工作,而是向前工作以寻找匹配项,因此h.*?a.*?re仅匹配“healthcare”,但它也匹配“hare”..

To this end you might want to consider not using * at all but instead using something more specific, like:为此,您可能要考虑完全不使用*而是使用更具体的东西,例如:

h.+?al.+?re    //+ means "one or more of the thing to the left"
h.{2}al.{4}re    //{n} means exactly n of the thing to the left

But the main take away;但主要带走; ditch everything you know about wildcards from DOS etc if you're getting into learning Regex如果您正在学习正则表达式,请放弃您对 DOS 等通配符的了解

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM