正则表达式从字符串中提取所有URL

Question

I have a string like this 我有这样的字符串

http://example.com/path/topage.htmlhttp://twitter.com/p/xyanhshttp://httpget.org/get.zipwww.google.com/privacy.htmlhttps://goodurl.net/

I would like to extract all url / webaddress into a Array. 我想将所有url / webaddress提取到一个数组中。 for example 例如

urls = ['http://example.com/path/topage.html','http://twitter.com/p/xyan',.....]

Here is my approach which didn't work. 这是我的方法不起作用。

import re
strings = "http://example.com/path/topage.htmlhttp://twitter.com/p/xyanhshttp://httpget.org/get.zipwww.google.com/privacy.htmlhttps://goodurl.net/"
links = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', strings)

print links
// result always same as strings

Answer 1

The problem is that your regex pattern is too inclusive. 问题是您的正则表达式模式太包容了。 It includes all urls. 它包括所有URL。 You can use lookahead by using (?=) 您可以使用（？=）

Try this: 尝试这个：

re.findall("((www\.|http://|https://)(www\.)*.*?(?=(www\.|http://|https://|$)))", strings)

Answer 2

Your problem is that http:// is being accepted as a valid part of a url. 您的问题是http://被接受为url的有效部分。 This is because of this token right here: 这是因为这里有此令牌：

[$-_@.&+]

or more specifically: 或更具体地说：

$-_

This matches all characters with the range from $ to _ , which includes a lot more characters than you probably intended to do. 这将匹配$到_范围内的所有字符，其中包含的字符比您可能打算的要多得多。

You can change this to [$\\-_@.&+] but this causes problems since now, / characters will not match. 您可以将其更改为[$\\-_@.&+]但这会导致问题，因为/字符不匹配。 So add it by using [$\\-_@.&+/] . 因此，可以使用[$\\-_@.&+/]添加它。 However, this will again cause problems since http://example.com/path/topage.htmlhttp would be considered a valid match. 但是，这将再次引起问题，因为http://example.com/path/topage.htmlhttp将被视为有效匹配。

The final addition is to add a lookahead to ensure that you are not matching http:// or https:// , which just so happens to be the first part of your regex! 最后添加的内容是添加前瞻性以确保您不匹配http://或https:// ，而这恰好是正则表达式的第一部分！

http[s]?://(?:(?!http[s]?://)[a-zA-Z]|[0-9]|[$\-_@.&+/]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+

tested here 在这里测试

Answer 3

A simple answer without getting into much complication: 一个简单的答案而不会引起太多的复杂性：

import re
url_list = []

for x in re.split("http://", l):
    url_list.append(re.split("https://",x))

url_list = [item for sublist in url_list for item in sublist]

In case you want to append the string http:// and https:// back to the urls, do appropriate changes to the code. 如果您想将字符串http://和https://附加回URL，请对代码进行适当的更改。 Hope i convey the idea. 希望我传达这个想法。

Answer 4

这是我的

(r’http[s]?://[a-zA-Z]{3}\.[a-zA-Z0-9]+\.[a-zA-Z]+')

正则表达式从字符串中提取所有URL

问题描述

4 个解决方案

解决方案1
2 已采纳 2016-08-02 21:39:14

解决方案2
1 2016-08-02 21:46:35

解决方案3
0 2016-08-02 22:02:54

解决方案4
0 2017-05-04 05:00:23

正则表达式从字符串中提取所有URL

问题描述

4 个解决方案

解决方案1 2 已采纳 2016-08-02 21:39:14

解决方案2 1 2016-08-02 21:46:35

解决方案3 0 2016-08-02 22:02:54

解决方案4 0 2017-05-04 05:00:23

解决方案1
2 已采纳 2016-08-02 21:39:14

解决方案2
1 2016-08-02 21:46:35

解决方案3
0 2016-08-02 22:02:54

解决方案4
0 2017-05-04 05:00:23