Difference between re.sub and re.findall

Question

I have strings which look like "Billboard Bill SpA". I want to have a regular expression that removes SpA, but only if there is a capitalised word before it. The regular expression I use is "[AZ][az]*\\s(SpA)". If I use re.sub both the SpA and the capitalised word before it get removed, which is expected.

re.sub("[A-Z][a-z]*\s(SpA)", "", "Billboard Bill SpA")
'Billboard '

However, if I use re.findall I get the functionality I need:

re.findall("[A-Z][a-z]*\s(SpA)", "Billboard Bill SpA")
['SpA']

I know I can write a pre expression with "?<=" which doesn't consume the pre text, but that works only for fixed length expressions. Anybody know what I can do to only remove "SpA" with re.sub, or make it work like re.findall?

To be more clear I want a regular expression to remove Spa, but only if there is a capitalized word before:

re.sub(regular_expresssion, "", "Billboard Bill SpA") -> Billboard Bill
re.sub(regular_expresssion, "", "to SpA") -> to SpA

Answer 1

Your re.sub is replacing the entire match, not just the group (SpA) . That's why it's also removing Bill . findall on the other hand is giving you the group.

In re.sub you can specify to include the part of the match that you don't want to delete.

re.sub("([A-Z][a-z]*\s)SpA", "\\1", "Billboard Bill SpA")
'Billboard Bill '

If you want to delete the space as well, move \\s outside of the parentheses.

Answer 2

Perform the substitution using groups.

>>> re.sub("([A-Z][a-z]*\s)(SpA)", "\\1", "Billboard Bill SpA")
'Billboard Bill '

Difference between re.sub and re.findall

Question

2 answers

solution1
2 ACCPTED 2014-08-14 08:11:45

solution2
1 2014-08-14 08:11:16

Difference between re.sub and re.findall

Question

2 answers

solution1 2 ACCPTED 2014-08-14 08:11:45

solution2 1 2014-08-14 08:11:16

solution1
2 ACCPTED 2014-08-14 08:11:45

solution2
1 2014-08-14 08:11:16