简体   繁体   中英

Python Regex: replace multiple possibilities of substring

I want to remove the indicator like Fig 1. in string caption , where caption may be:

# each line is one instance of caption
"Figure 1: Path of Reading Materials from the Web to a Student."
"FIGURE 1 - Travel CP-net"
"Figure 1 Interpretation as abduction, the big picture."
"Fig. 1. The feature vector components"
"Fig 1: IMAGACT Log-in Page"
"FIG 1 ; The effect of descriptive and interpretive information, and Inclination o f Fit"
...

I've tried caption = re.sub(r'figure 1: |fig. 1 |figure 1 -', '', caption, flags=re.IGNORECASE) , but it looks messy: do I really need to list all the possibilities manually? Is there any element re code to match 'em all?

Thanks a bunch!

You might use an optional part to match ure and use an optional character class to match the : , . , ; or -

If you want to match other digits than 1, use \d+

\bfig\.?(?:ure)? 1[^\S\r\n]*[:.;–-]?
  • \bfig Match fig preceded by a word boundary
  • \.? Match an optional dot
  • (?:ure)? Optionally match ure
  • 1 Match a space and 1
  • [^\S\r\n]* Match 0+ occurrences of a whitespace char except newlines
  • [:.;–-]? Optionally match any of the listed in the character class

Regex demo | Python demo

Example code to also match the whitespace after the character class:

caption = re.sub(r'\bfig\.?(?:ure)? 1[^\S\r\n]*[:.;–-]?[^\S\r\n]', '', caption, flags=re.IGNORECASE)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM