简体   繁体   中英

Regex to replace filepaths in a string when there's more than one in Python

I'm having trouble finding a way to match multiple filepaths in a string while maintaining the rest of the string.

EDIT: forgot to add that the filepath might contain a dot, so edited "username" to user.name"

# filepath always starts with "file:///" and ends with file extension
text = """this is an example text extracted from file:///c:/users/user.name/download/temp/anecdote.pdf 
1 of 4 page and I also continue with more text from 
another path file:///c:/windows/system32/now with space in name/file (1232).html running out of text to write."""

I've found many answers that work, but fails when theres more than one filepath , also replacing the other characters in between.

import re
fp_pattern = r"file:\/\/\/(\w|\W){1,255}\.[\w]{3,4}"
print(re.sub(fp_pattern, "*IGOTREPLACED*", text, flags=re.MULTILINE))

>>>"this is an example text extracted from *IGOTREPLACED* running out of text to write."

I've also tried using a "stop when after finding a whitespace after the pattern" but I couldn't get one to work:

fp_pattern = r"file:\/\/\/(\w|\W){1,255}\.[\w]{3,4} ([^\s]+)"
>>> 0 matches

Note that {1,255} is a greedy quantifier, and will match as many chars as possible, you need to add ? after it.

However, just using a lazy {1,255}? quantifier won't solve the problem. You need to define where the match should end. It seems you only want to match these URLs when the extension is immediately followed with whitespace or end of string.

Hence, use

fp_pattern = r"file:///.{1,255}?\.\w{3,4}(?!\S)"

See the regex demo

The (?!\\S) negative lookahead will fail any match if, immediately to the right of the current location, there is a non-whitespace char. .{1,255}? will match any 1 to 255 chars, as few as possible.

Use in Python as

re.sub(fp_pattern, "*IGOTREPLACED*", text, flags=re.S)

The re.MULTILINE ( re.M ) flag only redefines ^ and $ anchor behavior making them match start/end of lines rather than the whole string. The re.S flag allows . to match any chars, including line break chars.

Please never use (\\w|\\W){1,255}? , use .{1,255}? with re.S flag to match any char, else, performance will decrease.

You can try re.findall to find out how many time regex matches in string. Hope this helps.

import re
len(re.findall(pattern, string_to_search))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM