如何通过正则表达式获得最正确的匹配？

Question

I think this is a common problem. 我认为这是一个普遍的问题。 But I didn't find a satisfactory answer elsewhere. 但是我没有在其他地方找到满意的答案。

Suppose I extract some links from a website. 假设我从一个网站中提取了一些链接。 The links are like the following: 链接如下所示：

http://example.com/goto/http://example1.com/123.html
http://example1.com/456.html
http://example.com/yyy/goto/http://example2.com/789.html
http://example3.com/xxx.html

I want to use regular expression to convert them to their real links: 我想使用正则表达式将它们转换为真实链接：

http://example1.com/123.html
http://example1.com/456.html
http://example2.com/789.html
http://example3.com/xxx.html

However, I can't do that because of the greedy feature of RE. 但是，由于RE的贪婪特性，我无法做到这一点。 'http://.*$' will only match the whole sentence. 'http://.*$'将仅匹配整个句子。 Then I tried 'http://.*?$' but it didn't work either. 然后，我尝试使用'http://.*?$'但它也不起作用。 Nor did re.findall . re.findall也没有。 So is there any other way to do this? 那么还有其他方法可以做到这一点吗？

Yes. 是。 I can do it by str.split or str.index . 我可以通过str.split或str.index做到这str.index 。 But I'm still curious about whether there is a RE solution for this. 但是我仍然对此是否有RE解决方案感到好奇。

Answer 1

You don't need to use regex you can use str.split() to split your links with // then pickup the last part and concatenate that with http// : 您不需要使用正则表达式，可以使用str.split()使用//拆分链接，然后提取最后一部分并将其与http//连接：

>>> s="""http://example.com/goto/http://example1.com/123.html
... http://example1.com/456.html
... http://example.com/yyy/goto/http://example2.com/789.html
... http://example3.com/xxx.html"""
>>> ['http://'+s.split('//')[-1] for link in s.split('\n')]
['http://example3.com/xxx.html', 'http://example3.com/xxx.html', 'http://example3.com/xxx.html', 'http://example3.com/xxx.html']

And with regex you just need to replace all characters between 2 // with empty string but as you need one of // for the first use a positive look-behind : 使用正则表达式，您只需要用空字符串替换2 //之间的所有字符，但是因为您需要//中的一个，所以请使用正向后看：

>>> [re.sub(r'(?<=//)(.*)//','',link) for link in s.split('\n')]
['http://example1.com/123.html', 'http://example1.com/456.html', 'http://example2.com/789.html', 'http://example3.com/xxx.html']
>>>

Answer 2

~~use this pattern~~ ~~使用这种模式~~

 ^(.*?[^/])(?=\\/[^/]).*?([^/]+)$

~~and replace with $1/$2~~ ~~并替换为$1/$2~~
~~Demo~~ 演示

after reading comment below, use this pattern to capture what you want 阅读下面的评论后，使用此模式捕获您想要的内容

(http://(?:[^h]|h(?!ttp:))*)$

Demo 演示

or this pattern 或这种模式

(http://(?:(?!http:).)*)$

Demo 演示

or this pattern 或这种模式

http://.*?(?=http://)

and replace with nothing 并一无所获
Demo 演示

如何通过正则表达式获得最正确的匹配？

问题描述

2 个解决方案

解决方案1
1 2015-03-08 22:44:15

解决方案2
1 2015-03-09 00:48:43

如何通过正则表达式获得最正确的匹配？

问题描述

2 个解决方案

解决方案1 1 2015-03-08 22:44:15

解决方案2 1 2015-03-09 00:48:43

解决方案1
1 2015-03-08 22:44:15

解决方案2
1 2015-03-09 00:48:43