简体   繁体   中英

python regex re.sub: remove everything before or after a pattern until find a specific condition in both ways

I'm trying to remove everything between the two capital letters if there is a 'year' between them.

Here is what I have:

import re

string = 'Sep 09 2018*57.10*58.05*Sep 08 2018*56.76*54.91*Sep 07 2018*58.14*55.20*Sep 06 2018*55.07*54.66*Sep 06 2018*0.91 higher than last year, blablabla*Sep 05 2018*54.71*53.70'

string = re.sub(r'([A-Z].*year)(.*?)(?=[A-Z])', '*', string)

And, what I expect to get:

string = 'Sep 09 2018*57.10*58.05*Sep 08 2018*56.76*54.91*Sep 07 2018*58.14*55.20*Sep 06 2018*55.07*54.66*Sep 05 2018*54.71*53.70'

So, I "removed" everything up to the first capital letter before 'year' and everything until the next, which means '*Sep 06 2018*0.91 higher than last year, blablabla', but my code is starting from the begining, instead of from 'year' and then look backwards. I solved after 'year' already.

Appreciate if anybody can help me to fix this.

You may use

[A-Z][^A-Z]*year[^A-Z]*(?=[A-Z])

See the regex demo

Details

  • [AZ] - an uppercase letter
  • [^AZ]* - 0+ chars other than uppercase letters
  • year - a word
  • [^AZ]* - 0+ chars other than uppercase letters
  • (?=[AZ]) - immediately to the right of the current location, there should be an uppercase letter.

In Python, use

string = re.sub(r'[A-Z][^A-Z]*year[^A-Z]*(?=[A-Z])', '', string)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM