How to extract some url from html?

Question

I need to extract all image links from a local html file. Unfortunately, I can't install bs4 and cssutils to process html.

html = """<img src="https://s2.example.com/path/image0.jpg?lastmod=1625296911"><br>
<div><a style="background-image:url(https://s2.example.com/path/image1.jpg?lastmod=1625296911)"</a><a style="background-image:url(https://s2.example.com/path/image2.jpg?lastmod=1625296912)"></a><a style="background-image:url(https://s2.example.com/path/image3.jpg?lastmod=1625296912)"></a></div>"""

I tried to extract data using a regex:

images = []
for line in html.split('\n'):
    images.append(re.findall(r'(https://s2.*\?lastmod=\d+)', line))
print(images)

[['https://s2.example.com/path/image0.jpg?lastmod=1625296911'],
 ['https://s2.example.com/path/image1.jpg?lastmod=1625296911)"</a><a style="background-image:url(https://s2.example.com/path/image2.jpg?lastmod=1625296912)"></a><a style="background-image:url(https://s2.example.com/path/image3.jpg?lastmod=1625296912']]

I suppose my regular expression is greedy because I used .* ? How to get the following outcome?

images = ['https://s2.example.com/path/image0.jpg',
          'https://s2.example.com/path/image1.jpg',
          'https://s2.example.com/path/image2.jpg',
          'https://s2.example.com/path/image3.jpg']

If it can help all links are enclosed by src="..." or url(...)

Thanks for your help.

Answer 1

import re
indeces_start = sorted(
    [m.start()+5 for m in re.finditer("src=", html)]
    + [m.start()+4 for m in re.finditer("url", html)])
indeces_end = [m.end() for m in re.finditer(".jpg", html)]

image_list = []

for start,end in zip(indeces_start,indeces_end):
  image_list.append(html[start:end])

print(image_list)

That's a solution which comes to my mind. It consists of finding the start and end indeces of the image path strings. It obviously has to be adjusted if there are different image types.

Edit: Changed the start criteria, in case there are other URLs in the document

Answer 2

You can use

import re
html = """<img src="https://s2.example.com/path/image0.jpg?lastmod=1625296911"><br>
<div><a style="background-image:url(https://s2.example.com/path/image1.jpg?lastmod=1625296911)"</a><a style="background-image:url(https://s2.example.com/path/image2.jpg?lastmod=1625296912)"></a><a style="background-image:url(https://s2.example.com/path/image3.jpg?lastmod=1625296912)"></a></div>"""
images = re.findall(r'https://s2[^\s?]*(?=\?lastmod=\d)', html)
print(images)

See the Python demo . Output:

['https://s2.example.com/path/image0.jpg',
 'https://s2.example.com/path/image1.jpg',
 'https://s2.example.com/path/image2.jpg', 
 'https://s2.example.com/path/image3.jpg']

See the regex demo , too. It means

https://s2 - some literal text
[^\\s?]* -zero or more chars other than whitespace and ? chars
(?=\\?lastmod=\\d) - immediately to the right, there must be ?lastmode= and a digit (the text is not added to the match since it is a pattern inside a positive lookahead, a non-consuming pattern).

Answer 3

import re
xx = '<img src="https://s2.example.com/path/image0.jpg?lastmod=1625296911" alt="asdasd"><img a src="https://s2.example.com/path/image0.jpg?lastmod=1625296911">'
r1 = re.findall(r"<img(?=\s|>)[^>]*>",xx)
url = []
for x in r1:
  x = re.findall(r"src\s{0,}=\s{0,}['\"][\w\d:/.=]{0,}",x)
  if(len(x)== 0): continue
  x = re.findall(r"http[s]{0,1}[\w\d:/.=]{0,}",x[0])
  if(len(x)== 0): continue
  url.append(x[0])
print(url)

How to extract some url from html?

Question

3 answers

solution1
1 2021-10-25 14:58:28

solution2
0 ACCPTED 2021-10-25 16:32:46

solution3
0 2021-10-25 17:08:21

How to extract some url from html?

Question

3 answers

solution1 1 2021-10-25 14:58:28

solution2 0 ACCPTED 2021-10-25 16:32:46

solution3 0 2021-10-25 17:08:21

solution1
1 2021-10-25 14:58:28

solution2
0 ACCPTED 2021-10-25 16:32:46

solution3
0 2021-10-25 17:08:21