简体   繁体   English

当 Python 中有多个文件路径时,正则表达式替换字符串中的文件路径

[英]Regex to replace filepaths in a string when there's more than one in Python

I'm having trouble finding a way to match multiple filepaths in a string while maintaining the rest of the string.我无法找到一种方法来匹配字符串中的多个文件路径,同时保留字符串的其余部分。

EDIT: forgot to add that the filepath might contain a dot, so edited "username" to user.name"编辑:忘记添加文件路径可能包含一个点,因此将“用户名”编辑为 user.name”

# filepath always starts with "file:///" and ends with file extension
text = """this is an example text extracted from file:///c:/users/user.name/download/temp/anecdote.pdf 
1 of 4 page and I also continue with more text from 
another path file:///c:/windows/system32/now with space in name/file (1232).html running out of text to write."""

I've found many answers that work, but fails when theres more than one filepath , also replacing the other characters in between.我找到了许多有效的答案,但是当存在多个 filepath 时失败,并且还替换了中间的其他字符。

import re
fp_pattern = r"file:\/\/\/(\w|\W){1,255}\.[\w]{3,4}"
print(re.sub(fp_pattern, "*IGOTREPLACED*", text, flags=re.MULTILINE))

>>>"this is an example text extracted from *IGOTREPLACED* running out of text to write."

I've also tried using a "stop when after finding a whitespace after the pattern" but I couldn't get one to work:我也试过使用“在模式后找到空格后停止”,但我无法让一个工作:

fp_pattern = r"file:\/\/\/(\w|\W){1,255}\.[\w]{3,4} ([^\s]+)"
>>> 0 matches

Note that {1,255} is a greedy quantifier, and will match as many chars as possible, you need to add ?请注意, {1,255}是一个贪婪量词,会匹配尽可能多的字符,您需要添加? after it.之后。

However, just using a lazy {1,255}?但是,只使用懒惰的{1,255}? quantifier won't solve the problem.量词不能解决问题。 You need to define where the match should end.您需要定义比赛应该在哪里结束。 It seems you only want to match these URLs when the extension is immediately followed with whitespace or end of string.当扩展名后紧跟空格或字符串结尾时,您似乎只想匹配这些 URL。

Hence, use因此,使用

fp_pattern = r"file:///.{1,255}?\.\w{3,4}(?!\S)"

See the regex demo查看正则表达式演示

The (?!\\S) negative lookahead will fail any match if, immediately to the right of the current location, there is a non-whitespace char.如果在当前位置的右侧有一个非空白字符,则(?!\\S)负向前瞻将使任何匹配失败。 .{1,255}? will match any 1 to 255 chars, as few as possible.将匹配任何 1 到 255 个字符,尽可能少。

Use in Python as在 Python 中用作

re.sub(fp_pattern, "*IGOTREPLACED*", text, flags=re.S)

The re.MULTILINE ( re.M ) flag only redefines ^ and $ anchor behavior making them match start/end of lines rather than the whole string.re.MULTILINEre.M )标志仅重新定义了^$锚行为使其符合启动/线,而不是整个字符串的结尾。 The re.S flag allows . re.S标志允许. to match any chars, including line break chars.匹配任何字符,包括换行符。

Please never use (\\w|\\W){1,255}?请永远不要使用(\\w|\\W){1,255}? , use .{1,255}? , 使用.{1,255}? with re.S flag to match any char, else, performance will decrease.使用re.S标志匹配任何字符,否则性能会下降。

You can try re.findall to find out how many time regex matches in string.您可以尝试 re.findall 以找出字符串中正则表达式匹配的次数。 Hope this helps.希望这可以帮助。

import re
len(re.findall(pattern, string_to_search))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM