如何使用正则表达式提取特定的img src url格式？

Question

我的字符串：

Russia's National Settlement Depository discusses why it believes the biggest blockchain opportunities have yet to be uncovered.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw" width="1" />|One of the co-founder of digital currency startup Stellar announced their resignation today.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/xRzN7syt-v0" width="1" />|The editorial board for Bloomberg News has called for a permissive regulatory environment for blockchain development.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/ooQYB2iDxP8" width="1" />|

我想将这3个链接放入列表中：

http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw
http://feeds.feedburner.com/~r/CoinDesk/~4/xRzN7syt-v0
http://feeds.feedburner.com/~r/CoinDesk/~4/ooQYB2iDxP8

他们遵循这种模式：

src="http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw"

我知道我应该使用re.findall(pattern, string)实现这一点。

但是最大的问题是： 如何建立在这里有效的模式？

我不太擅长编写正则表达式模式。.我总是很困惑...几乎完成这项工作的人是：

pattern = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

但是我得到的只是这份清单：

[u'http://feeds.feedburner.com/', u'http://feeds.feedburner.com/', u'http://feeds.feedburner.com/']

看来问题出在~r部分以及之后的东西。

Answer 1

这些数据从哪里来？ 我建议使用html解析器，而不要尝试使用正则表达式进行提取。 您可以从其中的标签中提取完整值（如果来自html）

下面我将您的字符串放在test.html中（对于使用beautifulsoup为例的python）

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(open(r'A:\test.html'))
>>> [x['src'] for x in soup.findAll('img')]
['http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw', 'http://feeds.feedburner.com/~r/CoinDesk/~4/xRzN7syt-v0', 'http://feeds.feedburner.com/~r/CoinDesk/~4/ooQYB2iDxP8']

Answer 2

您在正则表达式中缺少~字符：

http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+~]|[!*\$\$,]|(?:%[0-9a-fA-F][0-9a-fA-F]))+

顺便说一句：这是在Python中测试正则表达式的超级方法！

Answer 3

试试这个脚本：

text1="""Russia's National Settlement Depository discusses why it believes the biggest blockchain opportunities have yet to be uncovered.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw" width="1" />|One of the co-founder of digital currency startup Stellar announced their resignation today.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/xRzN7syt-v0" width="1" />|The editorial board for Bloomberg News has called for a permissive regulatory environment for blockchain development.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/ooQYB2iDxP8" width="1" />|"""
import re
print re.findall(r'(https?://\S+)', text1)

结果是

['http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw"',   'http://feeds.feedburner.com/~r/CoinDesk/~4/xRzN7syt-v0"', 'http://feeds.feedburner.com/~r/CoinDesk/~4/ooQYB2iDxP8"']

Answer 4

尝试这个：

(?:src=)(".*?")

并获得组\\ 1

演示

如何使用正则表达式提取特定的img src url格式？

问题描述

4 个解决方案

解决方案1
2 2016-06-08 12:48:41

解决方案2
0 已采纳 2016-06-08 12:46:41

解决方案3
0 2016-06-08 12:47:10

解决方案4
0 2016-06-08 12:48:24

如何使用正则表达式提取特定的img src url格式？

问题描述

4 个解决方案

解决方案1 2 2016-06-08 12:48:41

解决方案2 0 已采纳 2016-06-08 12:46:41

解决方案3 0 2016-06-08 12:47:10

解决方案4 0 2016-06-08 12:48:24

解决方案1
2 2016-06-08 12:48:41

解决方案2
0 已采纳 2016-06-08 12:46:41

解决方案3
0 2016-06-08 12:47:10

解决方案4
0 2016-06-08 12:48:24