re.sub in python do not always substitute the string

Question

When I try to substitute a string with another string, it does not always happen with re.sub method.

sentence = '<date>2004/12/01</date>T09:38:27+01:00'+
           'Wed, <date>2012/9/05</date> 10:55:17 UTC %3C%3C%3C'

time_identifier = u'(?<=[\s\.,T])([\d]{2}[:]{1}[\d]{2}([:]{1}[\d]{2})*[\s\.,+]*(UTC|GMT|CEST|EDT|IST|BST)*(\d\d:\d\d)*)(?=[\s\.,T]|\Z)|'\
                  u'(?<=\A)([\d]{2}[:]{1}[\d]{2}([:]{1}[\d]{2})*[\s\.,+]*(UTC|GMT|CEST|EDT|IST|BST)*(\d\d:\d\d)*)(?=[\s\.,T]|\Z)'
time = re.search(time_identifier, sentence, flags=re.U|re.I)
    if time:
        try:
            sentence = re.sub(time.groups()[0], '<time>%s</time>'%time.groups()[0], sentence, flags=re.U|re.I)
        except:
            sentence = re.sub(time.groups()[4], '<time>%s</time>'%time.groups()[4], sentence, flags=re.U|re.I)

For the above provided example, I expect the output of the sentences to be

<date>2004/12/01</date>T<time>09:38:27+01:00<time>
Wed, <date>2012/9/05</date> <time>10:55:17 UTC</time> %3C%3C%3C

But the re.sub method do not replace "09:38:27+01:00" in the original sentence by

"<time>09:38:27+01:00</time>"

Can anyone please clarify the reason for this?

Answer 1

Your expressions are terribly over-complicated. The following is a simplification that matches the exact same patterns:

time_identifier = u'(?:(?<=[\s\.,T])|\A)(\d\d:\d\d(:\d\d)*[\s\.,+]*(UTC|GMT|CEST|EDT|IST|BST)*(\d\d:\d\d)*)(?=[\s\.,T]|\Z)'

Your time strings are not being matched because of the look-ahead assertion (the (?=[\\s\\.,T]|\\Z) part); it limits matches to anything that is followed by whitespace, a full stop, a comma, a letter T or the end of the string. Your first string is followed immediately by Wed in the sentence; there is no whitespace.

The following sentence value does match:

sentence = ('<date>2004/12/01</date>T09:38:27+01:00 '
            'Wed, <date>2012/9/05</date> 10:55:17 UTC %3C%3C%3C')

Note the extra space after the timezone.

Answer 2

You have a couple of problems here. First, your very complicated pattern. Second, you can't do something like:

re.sub('09:38:27+01', "<time>'09:38:27+01'</time>, s)

because due to the plus sign the string s doesn't match the pattern (I'm assuming that your groups contain the proper times) so that part of the string won't be tagged. That answers your question.

The following works with your sample data (although maybe I've over-simplified the initial pattern):

p = '((?:\\d{2}:\\d{2}:\\d{2}\\+\\d{2}:\\d{2})|(?:\\d{2}:\\d{2}:\\d{2} UTC|GMT|CEST|EDT|IST|BST))'
result = re.findall(p, s)
print result
['09:38:27+01:00', '10:55:17 UTC']
r0 = result[0]
r0 = re.sub('\+', r'\+', r0)
s = re.sub(r0, "<time>%s</time>" % result[0], s)
s = re.sub(result[1], "<time>%s</time>" % result[1], s)
print s
'<date>2004/12/01</date>T<time>09:38:27+01:00</time>Wed, <date>2012/9/05</date> <time>10:55:17 UTC</time> %3C%3C%3C'

Hope it helps.

re.sub in python do not always substitute the string

Question

2 answers

solution1
3 ACCPTED 2012-10-19 15:14:11

solution2
1 2012-10-19 16:39:08

re.sub in python do not always substitute the string

Question

2 answers

solution1 3 ACCPTED 2012-10-19 15:14:11

solution2 1 2012-10-19 16:39:08

solution1
3 ACCPTED 2012-10-19 15:14:11

solution2
1 2012-10-19 16:39:08