简体   繁体   中英

re.sub in python do not always substitute the string

When I try to substitute a string with another string, it does not always happen with re.sub method.

sentence = '<date>2004/12/01</date>T09:38:27+01:00'+
           'Wed, <date>2012/9/05</date> 10:55:17 UTC %3C%3C%3C'

time_identifier = u'(?<=[\s\.,T])([\d]{2}[:]{1}[\d]{2}([:]{1}[\d]{2})*[\s\.,+]*(UTC|GMT|CEST|EDT|IST|BST)*(\d\d:\d\d)*)(?=[\s\.,T]|\Z)|'\
                  u'(?<=\A)([\d]{2}[:]{1}[\d]{2}([:]{1}[\d]{2})*[\s\.,+]*(UTC|GMT|CEST|EDT|IST|BST)*(\d\d:\d\d)*)(?=[\s\.,T]|\Z)'
time = re.search(time_identifier, sentence, flags=re.U|re.I)
    if time:
        try:
            sentence = re.sub(time.groups()[0], '<time>%s</time>'%time.groups()[0], sentence, flags=re.U|re.I)
        except:
            sentence = re.sub(time.groups()[4], '<time>%s</time>'%time.groups()[4], sentence, flags=re.U|re.I)

For the above provided example, I expect the output of the sentences to be

<date>2004/12/01</date>T<time>09:38:27+01:00<time>
Wed, <date>2012/9/05</date> <time>10:55:17 UTC</time> %3C%3C%3C

But the re.sub method do not replace "09:38:27+01:00" in the original sentence by

"<time>09:38:27+01:00</time>"

Can anyone please clarify the reason for this?

Your expressions are terribly over-complicated. The following is a simplification that matches the exact same patterns:

time_identifier = u'(?:(?<=[\s\.,T])|\A)(\d\d:\d\d(:\d\d)*[\s\.,+]*(UTC|GMT|CEST|EDT|IST|BST)*(\d\d:\d\d)*)(?=[\s\.,T]|\Z)'

Your time strings are not being matched because of the look-ahead assertion (the (?=[\\s\\.,T]|\\Z) part); it limits matches to anything that is followed by whitespace, a full stop, a comma, a letter T or the end of the string. Your first string is followed immediately by Wed in the sentence; there is no whitespace.

The following sentence value does match:

sentence = ('<date>2004/12/01</date>T09:38:27+01:00 '
            'Wed, <date>2012/9/05</date> 10:55:17 UTC %3C%3C%3C')

Note the extra space after the timezone.

You have a couple of problems here. First, your very complicated pattern. Second, you can't do something like:

re.sub('09:38:27+01', "<time>'09:38:27+01'</time>, s)

because due to the plus sign the string s doesn't match the pattern (I'm assuming that your groups contain the proper times) so that part of the string won't be tagged. That answers your question.

The following works with your sample data (although maybe I've over-simplified the initial pattern):

p = '((?:\\d{2}:\\d{2}:\\d{2}\\+\\d{2}:\\d{2})|(?:\\d{2}:\\d{2}:\\d{2} UTC|GMT|CEST|EDT|IST|BST))'
result = re.findall(p, s)
print result
['09:38:27+01:00', '10:55:17 UTC']
r0 = result[0]
r0 = re.sub('\+', r'\+', r0)
s = re.sub(r0, "<time>%s</time>" % result[0], s)
s = re.sub(result[1], "<time>%s</time>" % result[1], s)
print s
'<date>2004/12/01</date>T<time>09:38:27+01:00</time>Wed, <date>2012/9/05</date> <time>10:55:17 UTC</time> %3C%3C%3C'

Hope it helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM