简体   繁体   English

正则表达式单词边界和匹配之间的距离

[英]Regex word boundaries and distance between matches

I would like to be able to use a regular expression to find any matches for a particular keyphrase within some text. 我希望能够使用正则表达式来查找某些文本中特定关键短语的任何匹配项。

The keyphrase may or may not contain 1 or more spaces (it would usually only be 1 word, but in some cases may be multiple words). 关键短语可能包含也可能不包含1个或多个空格(通常只有1个单词,但在某些情况下可能是多个单词)。

I am currently using the following expression where the keyphrase is a single word (containing no spaces): 我目前正在使用以下表达式,其中关键短语是单个单词(不包含空格):

var regexPattern = string.Format( "\\b({0})\\b", keyphrase );

When the keyphrase is multiple words (contains one or more spaces), I am then updating the expression to replace any of those spaces with a wildcard: 当关键短语是多个单词(包含一个或多个空格)时,我正在更新表达式以使用通配符替换任何这些空格:

regexPattern = regexPattern.Replace( " ", ".*" );

There are a couple of scenarios where this is not behaving as I need it to. 在某些情况下,这不符合我的需要。

1) If the keyphrase within my long text (that I'm searching for matches) is surrounded by either an underscore or a numeric, it no longer matches. 1)如果我的长文本(我正在搜索匹配项)中的关键短语被下划线或数字包围,则它不再匹配。 It's fine with hyphens, commas, full stops etc. In those scenarios, it still detects the keyphrase in there, but I also need it to match when the keyphrase is surrounded with underscores or numerics. 连字符,逗号,句号等等都没问题。在这些情况下,它仍然会检测到那里的关键短语,但是当关键短语被下划线或数字包围时,我还需要它来匹配。

2) In the scenario where my keyphrase consists of multiple words (contains 1 or more spaces), I would like to allow up to a certain maximum distance/length between each of the words that form my keyphrase. 2)在我的关键短语由多个单词组成(包含1个或多个空格)的场景中,我希望在形成我的关键短语的每个单词之间允许最多一定的最大距离/长度。

eg If my keyphrase is: 例如,如果我的密码短语是:

for sale

... and the text that I am matching against is ......和我匹配的文字是

I have a bike for    sale.

... (where there is up to a maximum distance of 5 characters between the keyphrase words), I would like the regex to match: ...(关键词之间的最大距离为5个字符),我希望正则表达式匹配:

bike for    sale

However, if there was more distance between the keyphrase words than 5 characters, I would not want it to match. 但是,如果关键短语之间的距离超过5个字符,我不希望它匹配。

Also, this 'distance' shouldn't be confined to the number of spaces that occur between the keyphrase words, as I would also like the following to match for example: 此外,这个“距离”不应该局限于关键短语之间出现的空格数量,因为我还希望以下匹配例如:

I have a bike for _.,1sale.

Finally, it's probably worth stating that in some cases, the keyphrase I'm searching for may appear more than once, and where the above conditions are met, I'd need both to be matched: 最后,可能值得指出的是,在某些情况下,我正在搜索的关键词可能不止一次出现,并且在满足上述条件的情况下,我需要两者匹配:

eg 例如

I have a bike for _.,1sale. I've also got a laptop for sale!

So, I essentially have 2 additional requirements on what I currently have, but don't know regular expressions well enough to know how I can implement these. 所以,我对我现在拥有的内容基本上有两个额外的要求,但是不知道正则表达式是否足以让我知道如何实现它们。

I think you can use the following code to address 2 issues: 我认为您可以使用以下代码来解决2个问题:

var regexPattern = string.Format( "(?<!\\p{{L}}){0}(?!\\p{{L}})", keyphrase );
// or
// var regexPattern = string.Format( "(?<=\\P{{L}}|^){0}(?=\\P{{L}}|$)", keyphrase );
regexPattern = regexPattern.Replace( " ", ".{0,5}" );

The regex will look like 正则表达式看起来像

(?<!\p{L})key.{0,5}word(?!\p{L})

or 要么

(?<=\P{L}|^)key.{0,5}word(?=\P{L}|$)

Here is demo 1 / demo 2 这是演示1 / 演示2

Mind that if you want to also match the inner word boundaries the same way, use 请注意,如果您想以相同的方式匹配内部单词边界,请使用

regexPattern = regexPattern.Replace( " ", "(?=\\P{L}).{0,5}(?<=\\P{L})" );

Regex will be 正则表达式将是

(?<!\p{L})key(?=\P{L}).{0,5}(?<=\P{L})word(?!\p{L})

or 要么

(?<=\P{L}|^)key(?=\P{L}).{0,5}(?<=\P{L})word(?=\P{L}|$)

See demo , it will exclude the cases where the 2 words won't match if glued. 请参阅演示 ,它将排除胶合时2个单词不匹配的情况。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM