简体   繁体   中英

Matching unique groups while maintaining their order

Is there a way to match unique groups of characters (words in the case below) in order of occurrence, purely in regex? If so, how does that expression compare in efficiency to a non-regex solution? I'm working with Python's flavor, but I would be interested in a solution for any other flavor, as well.

Here's a sample case:

string = 'the floodwaters are rising along the coast'
unique = ['the', 'floadwaters', 'are', 'rising', 'along', 'coast']

In a Python-regex hybrid solution I could match the groups I want, and use a list comprehension to remove the duplicates while maintaining the order.

groups = re.findall('[a-zA-Z]+', string)
unique = [g for i, g in enumerate(groups) if g not in groups[:i]]

There are similar questions across the site, such as one that addresses matching unique words . The expression from the accepted answer, however, matches the furthest right occurence of a given group, while I want to match the first occurence. Here's that expression:

(\w+\b)(?!.*\1\b)

A regex-only solution for this kind of task is only possible with an infinite-width lookbehind.

However , a regex solution like this should only be considered for use when the input is relatively short: more than 100 words in an input string will make it very slow due to backtracking that is inevitable in this case. Thus, for a mere learning purpose , I will share the regex that is only supported in .NET and Python PyPi regex library (it is also possible to do in Vim as its lookbehind is also infinite-width, but I guess there are even simpler ways with that powerful tool).

(?s)\b(\w+)\b(?<!^.*\b\1\b.*\b\1\b)

See the regex demo

The (?s) part is an inline modifier that makes . match all line breaks. You may use regex.DOTALL in Python regex .

Details

  • \\b - initial word boundary
  • (\\w+) - Group 1: one or more word chars
  • \\b - trailing word boundary
  • (?<!^.*\\b\\1\\b.*\\b\\1\\b) - an infinite width negative lookbehind that fails the match if the word matched into Group 1 happens to appear at least once before itself, ie if, immediately to the left of the current location (that is right after the word captured), there is a sequence of patterns:
    • ^ - start of string
    • .*\\b\\1\\b - any zero or more chars, as many as possible and then the same value as in Group 1 as a whole word
    • .*\\b\\1\\b - same as above (needed to match the captured word, since the lookbehind is used after the consumed word)

The .* in the lookbehind causes lots of backtracking, and the pattern will work rather slow in general, and very slow with large inputs and eventually might cause time outs.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM