正則表達式從字符串中提取所有URL

Question

我有這樣的字符串

http://example.com/path/topage.htmlhttp://twitter.com/p/xyanhshttp://httpget.org/get.zipwww.google.com/privacy.htmlhttps://goodurl.net/

我想將所有url / webaddress提取到一個數組中。 例如

urls = ['http://example.com/path/topage.html','http://twitter.com/p/xyan',.....]

這是我的方法不起作用。

import re
strings = "http://example.com/path/topage.htmlhttp://twitter.com/p/xyanhshttp://httpget.org/get.zipwww.google.com/privacy.htmlhttps://goodurl.net/"
links = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', strings)

print links
// result always same as strings

Answer 1

問題是您的正則表達式模式太包容了。 它包括所有URL。 您可以使用（？=）

嘗試這個：

re.findall("((www\.|http://|https://)(www\.)*.*?(?=(www\.|http://|https://|$)))", strings)

Answer 2

您的問題是http://被接受為url的有效部分。 這是因為這里有此令牌：

[$-_@.&+]

或更具體地說：

$-_

這將匹配$到_范圍內的所有字符，其中包含的字符比您可能打算的要多得多。

您可以將其更改為[$\\-_@.&+]但這會導致問題，因為/字符不匹配。 因此，可以使用[$\\-_@.&+/]添加它。 但是，這將再次引起問題，因為http://example.com/path/topage.htmlhttp將被視為有效匹配。

最后添加的內容是添加前瞻性以確保您不匹配http://或https:// ，而這恰好是正則表達式的第一部分！

http[s]?://(?:(?!http[s]?://)[a-zA-Z]|[0-9]|[$\-_@.&+/]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+

在這里測試

Answer 3

一個簡單的答案而不會引起太多的復雜性：

import re
url_list = []

for x in re.split("http://", l):
    url_list.append(re.split("https://",x))

url_list = [item for sublist in url_list for item in sublist]

如果您想將字符串http://和https://附加回URL，請對代碼進行適當的更改。 希望我傳達這個想法。

Answer 4

這是我的

(r’http[s]?://[a-zA-Z]{3}\.[a-zA-Z0-9]+\.[a-zA-Z]+')

正則表達式從字符串中提取所有URL

問題描述

4 個解決方案

解決方案1
2 已采納 2016-08-02 21:39:14

解決方案2
1 2016-08-02 21:46:35

解決方案3
0 2016-08-02 22:02:54

解決方案4
0 2017-05-04 05:00:23

正則表達式從字符串中提取所有URL

問題描述

4 個解決方案

解決方案1 2 已采納 2016-08-02 21:39:14

解決方案2 1 2016-08-02 21:46:35

解決方案3 0 2016-08-02 22:02:54

解決方案4 0 2017-05-04 05:00:23

解決方案1
2 已采納 2016-08-02 21:39:14

解決方案2
1 2016-08-02 21:46:35

解決方案3
0 2016-08-02 22:02:54

解決方案4
0 2017-05-04 05:00:23