简体   繁体   中英

finding indices of exact match words in python

I'm trying to find the indices of a pattern in a sentence. The pattern can be a word or a combination of words. I've used regular expressions for this. But I've some edge cases to handle.

import re

word = "is"
s = "Is (valid) is (valid), is-not (not valid), is. (valid) is!, (valid), is_1 (not valid) ,is (valid), is? (valid)"

iters = re.finditer(r"\b" + re.escape(word) + r"\b", s, re.I)
indices = [m.start(0) for m in iters]
print(indices)

This outputs

[0, 11, 23, 43, 55, 87, 99]

As you can see, the occurrence of is with certain symbols are required matches and some are not. Here is the list of valid symbols which can be taken into consideration for a match.

[" ", ",", ".", "!", "?"]

How to avoid the 3rd match ( is-not ) from the results?

If you search for all is-nots after, you can check which values are in one but not the other.

indices_is = [m.start(0) for m in iters_is]

Then you run that code again and get

indices_isnot = [m.start(0) for m in iters_isnot]

Real is list:

indices_is = [i for i in indeces_is if i not in indices_isnot]

If you're able to clearly define the word-boundary characters that are not allowed (in your provided example, it would only be the dash character ( - )), then a simple, regex-only solution could involved the concept of negative lookbehind and negative lookahead :

pattern = r"(?<!-)\b" + re.escape(word) + r"\b(?!-)"

The idea behind this regex is to match every instance of the word surrounded by word boundaries (as you were already doing) unless the word is preceded or followed by a dash. You could also look into using positive lookbehind and lookahead, ie instead of defining the list of characters that are not allowed, you would define the list of characters that are allowed to precede or follow the pattern. I mention this because you provided the list of allowed characters in your question; however, I'm not aware of a solution using this approach that also accounts for the possibility that the word is at the beginning or end of the line, due to the limitations of lookbehind/lookahead.

Your question is a little ambiguous in that you are specifying some specific characters as boundary characters (rather than any non-word character being a boundary character) and yet you are using the "\b" word boundary assertion in your code (which uses any non-word character as a boundary character). Thus, I cannot be sure if you simply want to adjust "\b" to not consider "-" as a boundary character or if you want to rewrite your regular expression to use exactly the boundary characters specified in your question.

To adjust "\b" to ignore "-" as a boundary character, you can use a negative lookbehind assertion and a negative lookahead assertion (to say basically, "unless the boundary is caused by the dash character") so only one line of your code would change:

    iters = re.finditer(r"(?<!-)\b" + re.escape(word) + r"\b(?!-)", s, re.I)

This change causes the output to become

    [0, 11, 43, 55, 87, 99]

which seems to be what you wanted. Just keep in mind that there are other non-word characters (in addition to the ones you mentioned) that would cause the regular expression to match (in a generalized string, as opposed to the one you supplied).

I am not going to supply a regular expression at this time for handling just the characters you specified because your example code used "\b" implying you wanted to use that, but just have it not consider "-" as a boundary character (and thus also implying that you listed boundary characters mostly from your example and did not make the list all-inclusive).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM