简体   繁体   中英

Regex to match all sequences of words

I need a python regex that will match all (non-empty) sequences of words in a string, assuming word is an arbitrary non-empty sequence of non-whitespace characters.

Something that will work like this:

s = "ab cd efg"
re.findall(..., s)
# ['ab', 'cd', 'efg', 'ab cd', 'cd efg', 'ab cd efg']

Closest I got to this was using regex module, but still not what I want:

regex.findall(r"\b\S.+\b", s, overlapped=True)
# ['ab cd efg', 'cd efg', 'efg']

Also, just to be clear, I don't want to have 'ab efg' in there.

Something like:

matches = "ab cd efg".split()
matches2 = [" ".join(matches[i:j])
            for i in range(len(matches))
            for j in range(i + 1, len(matches) + 1)]
print(matches2)

Outputs:

['ab', 'ab cd', 'ab cd efg', 'cd', 'cd efg', 'efg']

What you can do is match all of the strings and their whitespace, and then join contiguous slices together. (this is similar to Maxim's approach though I did develop this independently, but this preserves whitespace)

import regex
s = "ab cd efg"
subs = regex.findall(r"\S+\s*", s)
def combos(l):
	out = []
	for i in range(len(subs)):
		for j in range(i + 1, len(subs) + 1):
			out.append("".join(subs[i:j]).strip())
	return out
print(combos(subs))

Try it online!

This first finds all \\S+\\s* which matches a word followed by any amount of whitespace, and then gets all contiguous slices, joins them, and removes the whitespace from their right.

If whitespace is always a single space, just use Maxim's approach; it's simpler and faster but doesn't preserve whitespace.

Without regex:

import itertools
def n_wise(iterable, n=2):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    iterables = itertools.tee(iterable, n)
    for k, it in enumerate(iterables):
        for _ in range(k):
            next(it, None)
    return zip(*iterables)

def foo(s):
    s = s.split()
    for n in range(1, len(s)+1):
        for thing in n_wise(s, n=n):
            yield ' '.join(thing)

s = "ab cd efg hj"
result = [thing for thing in foo(s)]
print(result)

>>> 
['ab', 'cd', 'efg', 'hj', 'ab cd', 'cd efg', 'efg hj', 'ab cd efg', 'cd efg hj', 'ab cd efg hj']
>>>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM