[英]How to get the rightest match by regular expression?
I think this is a common problem. 我认为这是一个普遍的问题。 But I didn't find a satisfactory answer elsewhere. 但是我没有在其他地方找到满意的答案。
Suppose I extract some links from a website. 假设我从一个网站中提取了一些链接。 The links are like the following: 链接如下所示:
http://example.com/goto/http://example1.com/123.html
http://example1.com/456.html
http://example.com/yyy/goto/http://example2.com/789.html
http://example3.com/xxx.html
I want to use regular expression to convert them to their real links: 我想使用正则表达式将它们转换为真实链接:
http://example1.com/123.html
http://example1.com/456.html
http://example2.com/789.html
http://example3.com/xxx.html
However, I can't do that because of the greedy feature of RE. 但是,由于RE的贪婪特性,我无法做到这一点。 'http://.*$'
will only match the whole sentence. 'http://.*$'
将仅匹配整个句子。 Then I tried 'http://.*?$'
but it didn't work either. 然后,我尝试使用'http://.*?$'
但它也不起作用。 Nor did re.findall
. re.findall
也没有。 So is there any other way to do this? 那么还有其他方法可以做到这一点吗?
Yes. 是。 I can do it by str.split
or str.index
. 我可以通过str.split
或str.index
做到这str.index
。 But I'm still curious about whether there is a RE solution for this. 但是我仍然对此是否有RE解决方案感到好奇。
You don't need to use regex you can use str.split()
to split your links with //
then pickup the last part and concatenate that with http//
: 您不需要使用正则表达式,可以使用str.split()
使用//
拆分链接,然后提取最后一部分并将其与http//
连接:
>>> s="""http://example.com/goto/http://example1.com/123.html
... http://example1.com/456.html
... http://example.com/yyy/goto/http://example2.com/789.html
... http://example3.com/xxx.html"""
>>> ['http://'+s.split('//')[-1] for link in s.split('\n')]
['http://example3.com/xxx.html', 'http://example3.com/xxx.html', 'http://example3.com/xxx.html', 'http://example3.com/xxx.html']
And with regex you just need to replace all characters between 2 //
with empty string but as you need one of //
for the first use a positive look-behind : 使用正则表达式,您只需要用空字符串替换2 //
之间的所有字符,但是因为您需要//
中的一个,所以请使用正向后看 :
>>> [re.sub(r'(?<=//)(.*)//','',link) for link in s.split('\n')]
['http://example1.com/123.html', 'http://example1.com/456.html', 'http://example2.com/789.html', 'http://example3.com/xxx.html']
>>>
use this pattern 使用这种模式
^(.*?[^/])(?=\\/[^/]).*?([^/]+)$
and replace with $1/$2
并替换为 $1/$2
Demo 演示
after reading comment below, use this pattern to capture what you want 阅读下面的评论后,使用此模式捕获您想要的内容
(http://(?:[^h]|h(?!ttp:))*)$
or this pattern 或这种模式
(http://(?:(?!http:).)*)$
or this pattern 或这种模式
http://.*?(?=http://)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.