How can I extract a specific img src url format using regex?

Question

My string:

Russia's National Settlement Depository discusses why it believes the biggest blockchain opportunities have yet to be uncovered.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw" width="1" />|One of the co-founder of digital currency startup Stellar announced their resignation today.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/xRzN7syt-v0" width="1" />|The editorial board for Bloomberg News has called for a permissive regulatory environment for blockchain development.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/ooQYB2iDxP8" width="1" />|

I wanna get these 3 links into a list:

http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw
http://feeds.feedburner.com/~r/CoinDesk/~4/xRzN7syt-v0
http://feeds.feedburner.com/~r/CoinDesk/~4/ooQYB2iDxP8

They obey this pattern:

src="http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw"

I know that I should use re.findall(pattern, string) to achieve that.

But the big question is: How can I build a pattern that works here?

I'm not that good at writing regex patterns.. I always get confused... the one that almost got the job done was this one:

pattern = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

But all I got was this list:

[u'http://feeds.feedburner.com/', u'http://feeds.feedburner.com/', u'http://feeds.feedburner.com/']

It looks like the problem is with the ~r part and the stuff after that.

Answer 1

where is this data coming from ? I'd suggest using an html parser instead of trying to extract with regex. you can pull out the full values from within the tags there if that's coming from html

below i put your string in test.html (for python using beautifulsoup as example)

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(open(r'A:\test.html'))
>>> [x['src'] for x in soup.findAll('img')]
['http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw', 'http://feeds.feedburner.com/~r/CoinDesk/~4/xRzN7syt-v0', 'http://feeds.feedburner.com/~r/CoinDesk/~4/ooQYB2iDxP8']

Answer 2

You are missing the ~ character in your regex:

http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+~]|[!*\$\$,]|(?:%[0-9a-fA-F][0-9a-fA-F]))+

btw: this is a super way to test regex in Python!

Answer 3

try this script :

text1="""Russia's National Settlement Depository discusses why it believes the biggest blockchain opportunities have yet to be uncovered.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw" width="1" />|One of the co-founder of digital currency startup Stellar announced their resignation today.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/xRzN7syt-v0" width="1" />|The editorial board for Bloomberg News has called for a permissive regulatory environment for blockchain development.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/ooQYB2iDxP8" width="1" />|"""
import re
print re.findall(r'(https?://\S+)', text1)

and the result is

['http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw"',   'http://feeds.feedburner.com/~r/CoinDesk/~4/xRzN7syt-v0"', 'http://feeds.feedburner.com/~r/CoinDesk/~4/ooQYB2iDxP8"']

Answer 4

try this :

(?:src=)(".*?")

and get group \\1

DEMO

How can I extract a specific img src url format using regex?

Question

4 answers

solution1
2 2016-06-08 12:48:41

solution2
0 ACCPTED 2016-06-08 12:46:41

solution3
0 2016-06-08 12:47:10

solution4
0 2016-06-08 12:48:24

How can I extract a specific img src url format using regex?

Question

4 answers

solution1 2 2016-06-08 12:48:41

solution2 0 ACCPTED 2016-06-08 12:46:41

solution3 0 2016-06-08 12:47:10

solution4 0 2016-06-08 12:48:24

solution1
2 2016-06-08 12:48:41

solution2
0 ACCPTED 2016-06-08 12:46:41

solution3
0 2016-06-08 12:47:10

solution4
0 2016-06-08 12:48:24