简体   繁体   中英

python regular expression: how to ignore the irrelevant matches?

I have a text and there is one sentence contains the word "since". My attemp was to use regular expressions to extract the text after the word "since" till the next and previous period. For example, the text is:

text = "I like to live in a big city. Today is Monday, since yesterday was Sunday."

My regular expression is

rule = re.compile(r'([a-zA-Z0-9\,\.\s\'])\bsince\b([a-zA-Z0-9\,\.\s\'])', re.IGNORECASE)
patterns = rule.match(text)

However, patterns.group(1) returns I like to live in a big city. Today is Monday, I like to live in a big city. Today is Monday, which contains the sentence I don't want, ie I only want Today is Monday, . How to use regular expressions to do this?

You may use this regex:

[^.]*? since [^.]*?\.

RegEx Demo

Code:

import re

text = "I like to live in a big city. Today is Monday, since yesterday was Sunday."
print (re.findall(r'[^.]*? since [^.]*?\.', text))

Output:

[' Today is Monday, since yesterday was Sunday.']

RegEx Details:

  • [^.]*? : Match 0 or more characters that are not a dot
  • since : Match " since "
  • [^.]*? : Match 0 or more characters that are not a dot
  • \. : Match a dot

Using re.complie : Fixing OP's attempt here.

import re
rule = re.compile(r'.*?\.\s+([^,]*),\s+since', re.IGNORECASE)
patterns = rule.match(text)
patterns.group(1)
'Today is Monday'


With your shown samples, please try following. We could use findall function of Python here of its re library.

import re
text = "I like to live in a big city. Today is Monday, since yesterday was Sunday."
re.findall(r'.*?\.\s+([^,]*),\s+since',text)

Explanation of regex:

.*?\.\s+([^,]*),\s+since : Using non-greedy match till literal . then mentioning 1 or more space occurrences, followed by Today till comma occurrence(in a capturing group). Which is followed by a , followed by spaces 1 or more occurrences along with string since here.

You can use the regex , (?<=\.).*(?=\bsince\b) .

  • (?<=\.) : Postive Lookbehind assertion for .
  • .* : Any character any number of times
  • (?=\bsince\b) : Postive Lookahead assertion for the word, since

Demo:

import re

text = "I like to live in a big city. Today is Monday, since yesterday was Sunday."

m = re.search('(?<=\\.).*(?=\\bsince\\b)', text)
if m:
    print(m.group())

Output:

 Today is Monday, 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM