简体   繁体   English

如何通过正则表达式获得最正确的匹配?

[英]How to get the rightest match by regular expression?

I think this is a common problem. 我认为这是一个普遍的问题。 But I didn't find a satisfactory answer elsewhere. 但是我没有在其他地方找到满意的答案。

Suppose I extract some links from a website. 假设我从一个网站中提取了一些链接。 The links are like the following: 链接如下所示:

http://example.com/goto/http://example1.com/123.html
http://example1.com/456.html
http://example.com/yyy/goto/http://example2.com/789.html
http://example3.com/xxx.html

I want to use regular expression to convert them to their real links: 我想使用正则表达式将它们转换为真实链接:

http://example1.com/123.html
http://example1.com/456.html
http://example2.com/789.html
http://example3.com/xxx.html

However, I can't do that because of the greedy feature of RE. 但是,由于RE的贪婪特性,我无法做到这一点。 'http://.*$' will only match the whole sentence. 'http://.*$'将仅匹配整个句子。 Then I tried 'http://.*?$' but it didn't work either. 然后,我尝试使用'http://.*?$'但它也不起作用。 Nor did re.findall . re.findall也没有。 So is there any other way to do this? 那么还有其他方法可以做到这一点吗?


Yes. 是。 I can do it by str.split or str.index . 我可以通过str.splitstr.index做到这str.index But I'm still curious about whether there is a RE solution for this. 但是我仍然对此是否有RE解决方案感到好奇。

You don't need to use regex you can use str.split() to split your links with // then pickup the last part and concatenate that with http// : 您不需要使用正则表达式,可以使用str.split()使用//拆分链接,然后提取最后一部分并将其与http//连接:

>>> s="""http://example.com/goto/http://example1.com/123.html
... http://example1.com/456.html
... http://example.com/yyy/goto/http://example2.com/789.html
... http://example3.com/xxx.html"""
>>> ['http://'+s.split('//')[-1] for link in s.split('\n')]
['http://example3.com/xxx.html', 'http://example3.com/xxx.html', 'http://example3.com/xxx.html', 'http://example3.com/xxx.html']

And with regex you just need to replace all characters between 2 // with empty string but as you need one of // for the first use a positive look-behind : 使用正则表达式,您只需要用空字符串替换2 //之间的所有字符,但是因为您需要//中的一个,所以请使用正向后看

>>> [re.sub(r'(?<=//)(.*)//','',link) for link in s.split('\n')]
['http://example1.com/123.html', 'http://example1.com/456.html', 'http://example2.com/789.html', 'http://example3.com/xxx.html']
>>> 

use this pattern 使用这种模式

 ^(.*?[^/])(?=\\/[^/]).*?([^/]+)$ 

and replace with $1/$2 并替换为$1/$2
Demo 演示


after reading comment below, use this pattern to capture what you want 阅读下面的评论后,使用此模式捕获您想要的内容

(http://(?:[^h]|h(?!ttp:))*)$

Demo 演示


or this pattern 或这种模式

(http://(?:(?!http:).)*)$  

Demo 演示


or this pattern 或这种模式

http://.*?(?=http://)  

and replace with nothing 并一无所获
Demo 演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM