简体   繁体   中英

Python Regex: Matching a phrase regardless of intermediate spaces

Given a phrase in a given line, I need to be able to match that phrase even if the words have a different number of spaces in the line.

Thus, if the phrase is "the quick brown fox" and the line is "the quick brown fox jumped over the lazy dog" , the instance of "the quick brown fox" should still be matched.

The method I already tried was to replace all instances of whitespace in the line with a regex pattern for whitespace, but this doesn't always work if the line contains characters that aren't treated as literal by regex.

This should work:

import re

pattern = r'the\s+quick\s+brown\s+fox'
text = 'the           quick      brown        fox jumped over the lazy dog'

match = re.match(pattern, text)
print(match.group(0))

The output is:

the           quick      brown        fox

You can use this regex. Check here

(the\s+quick\s+brown\s+fox)

You can split the given string by white spaces and join them back by a white space, so that you can then compare it to the phrase you're looking for:

s = "the           quick      brown        fox"
' '.join(s.split()) == "the quick brown fox" # returns True

for the general case:

  1. replace each sequence of space characters in only one space character.
  2. check if the given sentence is sub string of the line after the replacement

     import re pattern = "your pattern" for line in lines: line_without_spaces= re.sub(r'\\s+', ' ', line) # will replace multiple spaces with one space return pattern in line_without_spaces 

As your later clarified, you needed to match any line and series of words. To achieve this I added some more examples to clarify what the both proposed similar regexes do:

text = """the           quick      brown        fox
another line                    with single and multiple            spaces
some     other       instance     with        six                      words"""

Matching whole lines

The first one matches the whole line, iterating over the single lines

pattern1 = re.compile(r'((?:\w+)(?:\s+|$))+')
for i, line in enumerate(text.split('\n')):
    match = re.match(pattern1, line)
    print(i, match.group(0))

Its output is:

0 the           quick      brown        fox
1 another line                    with single and multiple            spaces
2 some     other       instance     with        six                      words

Matching whole lines

The second one matches single words and iterates of them one-by-one while iterating over the single lines:

pattern2 = re.compile(r'(\w+)(?:\s+|$)')
for i, line in enumerate(text.split('\n')):
    for m in re.finditer(pattern2, line):
        print(m.group(1))
    print()

Its output is:

the
quick
brown
fox

another
line
with
single
and
multiple
spaces

some
other
instance
with
six
words

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM