如何从html中提取一些网址？

Question

我需要从本地 html 文件中提取所有图像链接。 不幸的是，我无法安装bs4和cssutils来处理 html。

html = """<img src="https://s2.example.com/path/image0.jpg?lastmod=1625296911"><br>
<div><a style="background-image:url(https://s2.example.com/path/image1.jpg?lastmod=1625296911)"</a><a style="background-image:url(https://s2.example.com/path/image2.jpg?lastmod=1625296912)"></a><a style="background-image:url(https://s2.example.com/path/image3.jpg?lastmod=1625296912)"></a></div>"""

我尝试使用正则表达式提取数据：

images = []
for line in html.split('\n'):
    images.append(re.findall(r'(https://s2.*\?lastmod=\d+)', line))
print(images)

[['https://s2.example.com/path/image0.jpg?lastmod=1625296911'],
 ['https://s2.example.com/path/image1.jpg?lastmod=1625296911)"</a><a style="background-image:url(https://s2.example.com/path/image2.jpg?lastmod=1625296912)"></a><a style="background-image:url(https://s2.example.com/path/image3.jpg?lastmod=1625296912']]

我想我的正则表达式是贪婪的，因为我使用了.* ？ 如何得到以下结果？

images = ['https://s2.example.com/path/image0.jpg',
          'https://s2.example.com/path/image1.jpg',
          'https://s2.example.com/path/image2.jpg',
          'https://s2.example.com/path/image3.jpg']

如果它可以帮助所有链接都包含在src="..."或url(...)

谢谢你的帮助。

Answer 1

import re
indeces_start = sorted(
    [m.start()+5 for m in re.finditer("src=", html)]
    + [m.start()+4 for m in re.finditer("url", html)])
indeces_end = [m.end() for m in re.finditer(".jpg", html)]

image_list = []

for start,end in zip(indeces_start,indeces_end):
  image_list.append(html[start:end])

print(image_list)

这是我想到的一个解决方案。 它包括查找图像路径字符串的开始和结束索引。 如果有不同的图像类型，显然必须进行调整。

编辑：更改了启动条件，以防文档中存在其他 URL

Answer 2

您可以使用

import re
html = """<img src="https://s2.example.com/path/image0.jpg?lastmod=1625296911"><br>
<div><a style="background-image:url(https://s2.example.com/path/image1.jpg?lastmod=1625296911)"</a><a style="background-image:url(https://s2.example.com/path/image2.jpg?lastmod=1625296912)"></a><a style="background-image:url(https://s2.example.com/path/image3.jpg?lastmod=1625296912)"></a></div>"""
images = re.findall(r'https://s2[^\s?]*(?=\?lastmod=\d)', html)
print(images)

请参阅Python 演示。 输出：

['https://s2.example.com/path/image0.jpg',
 'https://s2.example.com/path/image1.jpg',
 'https://s2.example.com/path/image2.jpg', 
 'https://s2.example.com/path/image3.jpg']

也请参阅正则表达式演示。 它的意思是

https://s2 - 一些文字
[^\\s?]*除空格和?之外的零个或多个字符字符
(?=\\?lastmod=\\d) - 紧靠右侧，必须有?lastmode=和一个数字（文本不会添加到匹配中，因为它是正向前瞻中的一个模式，一个非消耗模式） .

Answer 3

import re
xx = '<img src="https://s2.example.com/path/image0.jpg?lastmod=1625296911" alt="asdasd"><img a src="https://s2.example.com/path/image0.jpg?lastmod=1625296911">'
r1 = re.findall(r"<img(?=\s|>)[^>]*>",xx)
url = []
for x in r1:
  x = re.findall(r"src\s{0,}=\s{0,}['\"][\w\d:/.=]{0,}",x)
  if(len(x)== 0): continue
  x = re.findall(r"http[s]{0,1}[\w\d:/.=]{0,}",x[0])
  if(len(x)== 0): continue
  url.append(x[0])
print(url)

如何从html中提取一些网址？

问题描述

3 个解决方案

解决方案1
1 2021-10-25 14:58:28

解决方案2
0 已采纳 2021-10-25 16:32:46

解决方案3
0 2021-10-25 17:08:21

如何从html中提取一些网址？

问题描述

3 个解决方案

解决方案1 1 2021-10-25 14:58:28

解决方案2 0 已采纳 2021-10-25 16:32:46

解决方案3 0 2021-10-25 17:08:21

解决方案1
1 2021-10-25 14:58:28

解决方案2
0 已采纳 2021-10-25 16:32:46

解决方案3
0 2021-10-25 17:08:21