简体   繁体   中英

Can I declare preference over matching terms in a regular expression?

Is there a way to declare preference in a regular expression?
For example assume I have the following terms to search:

cat eats mouse

And I have the following text:

I saw yesterday a big mouse in our house. Why? We have a cat!A cat eats mouse.Right?

I want a regular expression that matches the section specifically the section A cat eats mouse .
Ie although the terms exist in other parts, that sentence is a better match ie it is prefered.

But if this part was missing it would have matched the I saw yesterday a big mouse in our house . Or We have a cat .

Can this be expressed in a regular expression?

No, regex isn't the right tool for this.

You can use a regex (though a plain substring search might be more appropriate) to find each of the words you're looking for, and assign weights to the matches (based number of occurrence of each term, appearance of all terms, relative order of the terms...) outside the regex.

But your end goal is too fuzzy, not regular enough - you'll need more than just regular expressions.

I'm not sure what kind of pattern you're looking to apply, but note that when using the vertical bar to write alternatives, the first one that matches will succeed. This means that if you have something like (<pattern1>|<pattern2>) if both of them match, the preference will be given to <pattern1> since that's the first one that will be checked.

Regular expressions are basically for matching words of regular languages, in most programming contexts, parts of the matched word are then extracted and used in the program. However, your matching pattern is context-sensitive (the matcher needs to both remember what has been before and what comes next) and therefore not in the expression power of regular expressions.

An approach to your problem could be that you use a sentence tokenizer to extract sentences and then score each sentence based on the words withing and, eventually, their constellation. Your problem seems highly related to the problem of automated text summarization. So you could look for information on this.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM