简体   繁体   中英

How can I extract a specific img src url format using regex?

My string:

Russia's National Settlement Depository discusses why it believes the biggest blockchain opportunities have yet to be uncovered.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw" width="1" />|One of the co-founder of digital currency startup Stellar announced their resignation today.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/xRzN7syt-v0" width="1" />|The editorial board for Bloomberg News has called for a permissive regulatory environment for blockchain development.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/ooQYB2iDxP8" width="1" />|

I wanna get these 3 links into a list:

http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw
http://feeds.feedburner.com/~r/CoinDesk/~4/xRzN7syt-v0
http://feeds.feedburner.com/~r/CoinDesk/~4/ooQYB2iDxP8

They obey this pattern:

src="http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw"

I know that I should use re.findall(pattern, string) to achieve that.

But the big question is: How can I build a pattern that works here?

I'm not that good at writing regex patterns.. I always get confused... the one that almost got the job done was this one:

pattern = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

But all I got was this list:

[u'http://feeds.feedburner.com/', u'http://feeds.feedburner.com/', u'http://feeds.feedburner.com/']

It looks like the problem is with the ~r part and the stuff after that.

where is this data coming from ? I'd suggest using an html parser instead of trying to extract with regex. you can pull out the full values from within the tags there if that's coming from html

below i put your string in test.html (for python using beautifulsoup as example)

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(open(r'A:\test.html'))
>>> [x['src'] for x in soup.findAll('img')]
['http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw', 'http://feeds.feedburner.com/~r/CoinDesk/~4/xRzN7syt-v0', 'http://feeds.feedburner.com/~r/CoinDesk/~4/ooQYB2iDxP8']

You are missing the ~ character in your regex:

http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+~]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+

btw: this is a super way to test regex in Python!

try this script :

text1="""Russia's National Settlement Depository discusses why it believes the biggest blockchain opportunities have yet to be uncovered.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw" width="1" />|One of the co-founder of digital currency startup Stellar announced their resignation today.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/xRzN7syt-v0" width="1" />|The editorial board for Bloomberg News has called for a permissive regulatory environment for blockchain development.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/ooQYB2iDxP8" width="1" />|"""
import re
print re.findall(r'(https?://\S+)', text1)

and the result is

['http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw"',   'http://feeds.feedburner.com/~r/CoinDesk/~4/xRzN7syt-v0"', 'http://feeds.feedburner.com/~r/CoinDesk/~4/ooQYB2iDxP8"']

try this :

(?:src=)(".*?")

and get group \\1

DEMO

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM