Python Regex: replace multiple possibilities of substring

Question

I want to remove the indicator like Fig 1. in string caption , where caption may be:

# each line is one instance of caption
"Figure 1: Path of Reading Materials from the Web to a Student."
"FIGURE 1 - Travel CP-net"
"Figure 1 Interpretation as abduction, the big picture."
"Fig. 1. The feature vector components"
"Fig 1: IMAGACT Log-in Page"
"FIG 1 ; The effect of descriptive and interpretive information, and Inclination o f Fit"
...

I've tried caption = re.sub(r'figure 1: |fig. 1 |figure 1 -', '', caption, flags=re.IGNORECASE) , but it looks messy: do I really need to list all the possibilities manually? Is there any element re code to match 'em all?

Thanks a bunch!

Answer 1

You might use an optional part to match ure and use an optional character class to match the : , . , ; or -

If you want to match other digits than 1, use \d+

\bfig\.?(?:ure)? 1[^\S\r\n]*[:.;–-]?

\bfig Match fig preceded by a word boundary
\.? Match an optional dot
(?:ure)? Optionally match ure
1 Match a space and 1
[^\S\r\n]* Match 0+ occurrences of a whitespace char except newlines
[:.;–-]? Optionally match any of the listed in the character class

Regex demo | Python demo

Example code to also match the whitespace after the character class:

caption = re.sub(r'\bfig\.?(?:ure)? 1[^\S\r\n]*[:.;–-]?[^\S\r\n]', '', caption, flags=re.IGNORECASE)

Python Regex: replace multiple possibilities of substring

Question

1 answers

solution1
1 ACCPTED 2020-05-27 13:27:23

Python Regex: replace multiple possibilities of substring

Question

1 answers

solution1 1 ACCPTED 2020-05-27 13:27:23

solution1
1 ACCPTED 2020-05-27 13:27:23