如何從html中提取一些網址？

Question

我需要從本地 html 文件中提取所有圖像鏈接。 不幸的是，我無法安裝bs4和cssutils來處理 html。

html = """<img src="https://s2.example.com/path/image0.jpg?lastmod=1625296911"><br>
<div><a style="background-image:url(https://s2.example.com/path/image1.jpg?lastmod=1625296911)"</a><a style="background-image:url(https://s2.example.com/path/image2.jpg?lastmod=1625296912)"></a><a style="background-image:url(https://s2.example.com/path/image3.jpg?lastmod=1625296912)"></a></div>"""

我嘗試使用正則表達式提取數據：

images = []
for line in html.split('\n'):
    images.append(re.findall(r'(https://s2.*\?lastmod=\d+)', line))
print(images)

[['https://s2.example.com/path/image0.jpg?lastmod=1625296911'],
 ['https://s2.example.com/path/image1.jpg?lastmod=1625296911)"</a><a style="background-image:url(https://s2.example.com/path/image2.jpg?lastmod=1625296912)"></a><a style="background-image:url(https://s2.example.com/path/image3.jpg?lastmod=1625296912']]

我想我的正則表達式是貪婪的，因為我使用了.* ？ 如何得到以下結果？

images = ['https://s2.example.com/path/image0.jpg',
          'https://s2.example.com/path/image1.jpg',
          'https://s2.example.com/path/image2.jpg',
          'https://s2.example.com/path/image3.jpg']

如果它可以幫助所有鏈接都包含在src="..."或url(...)

謝謝你的幫助。

Answer 1

import re
indeces_start = sorted(
    [m.start()+5 for m in re.finditer("src=", html)]
    + [m.start()+4 for m in re.finditer("url", html)])
indeces_end = [m.end() for m in re.finditer(".jpg", html)]

image_list = []

for start,end in zip(indeces_start,indeces_end):
  image_list.append(html[start:end])

print(image_list)

這是我想到的一個解決方案。 它包括查找圖像路徑字符串的開始和結束索引。 如果有不同的圖像類型，顯然必須進行調整。

編輯：更改了啟動條件，以防文檔中存在其他 URL

Answer 2

您可以使用

import re
html = """<img src="https://s2.example.com/path/image0.jpg?lastmod=1625296911"><br>
<div><a style="background-image:url(https://s2.example.com/path/image1.jpg?lastmod=1625296911)"</a><a style="background-image:url(https://s2.example.com/path/image2.jpg?lastmod=1625296912)"></a><a style="background-image:url(https://s2.example.com/path/image3.jpg?lastmod=1625296912)"></a></div>"""
images = re.findall(r'https://s2[^\s?]*(?=\?lastmod=\d)', html)
print(images)

請參閱Python 演示。 輸出：

['https://s2.example.com/path/image0.jpg',
 'https://s2.example.com/path/image1.jpg',
 'https://s2.example.com/path/image2.jpg', 
 'https://s2.example.com/path/image3.jpg']

也請參閱正則表達式演示。 它的意思是

https://s2 - 一些文字
[^\\s?]*除空格和?之外的零個或多個字符字符
(?=\\?lastmod=\\d) - 緊靠右側，必須有?lastmode=和一個數字（文本不會添加到匹配中，因為它是正向前瞻中的一個模式，一個非消耗模式） .

Answer 3

import re
xx = '<img src="https://s2.example.com/path/image0.jpg?lastmod=1625296911" alt="asdasd"><img a src="https://s2.example.com/path/image0.jpg?lastmod=1625296911">'
r1 = re.findall(r"<img(?=\s|>)[^>]*>",xx)
url = []
for x in r1:
  x = re.findall(r"src\s{0,}=\s{0,}['\"][\w\d:/.=]{0,}",x)
  if(len(x)== 0): continue
  x = re.findall(r"http[s]{0,1}[\w\d:/.=]{0,}",x[0])
  if(len(x)== 0): continue
  url.append(x[0])
print(url)

如何從html中提取一些網址？

問題描述

3 個解決方案

解決方案1
1 2021-10-25 14:58:28

解決方案2
0 已采納 2021-10-25 16:32:46

解決方案3
0 2021-10-25 17:08:21

如何從html中提取一些網址？

問題描述

3 個解決方案

解決方案1 1 2021-10-25 14:58:28

解決方案2 0 已采納 2021-10-25 16:32:46

解決方案3 0 2021-10-25 17:08:21

解決方案1
1 2021-10-25 14:58:28

解決方案2
0 已采納 2021-10-25 16:32:46

解決方案3
0 2021-10-25 17:08:21