繁体   English   中英

如何使用 BeautifulSoup 从动态 web 页面获取图像标签?

[英]How to get an image tag from a dynamic web page using BeautifulSoup?

嗨,我正在尝试使用请求和 BeautifulSoup 在网页上获取图像。

import requests
from bs4 import BeautifulSoup as BS

data = requests.get(url, headers=headers).content
soup = BS(data, "html.parser")
for imgtag in soup.find_all("img", class_="slider-img"):
    print(imgtag["src"])

问题是当我在data中获取网页时,它不包含图像标签。 然而,当我通过 web 浏览器访问网页时,div 标签填充了多个<img class="slider-img">标签。

我是新手,所以我不知道 web 页面发生了什么。 提前感谢您的帮助。

PS - web 页面使用Fotorama Slider并且 src 属性包含 CDN 链接。 如果这很重要

图像标签由 Javascript 动态创建。 您只需要uuid来构建图像 url 并将它们存储在页面中:

import re
import requests
from ast import literal_eval


url = "https://fotorama.io/"
img_url = "https://ucarecdn.com/{uuid}/-/stretch/off/-/resize/760x/"

html_doc = requests.get(url).text
uuids = re.search(r"uuids: (\[.*?\])", html_doc, flags=re.S).group(1)
uuids = literal_eval(uuids)

for uuid in uuids:
    print(img_url.format(uuid=uuid))

印刷:

https://ucarecdn.com/05e7ff61-c1d5-4d96-ae79-c381956cca2e/-/stretch/off/-/resize/760x/
https://ucarecdn.com/cd8dfa25-2bc5-4546-995a-f3fd23809e1d/-/stretch/off/-/resize/760x/
https://ucarecdn.com/382a5139-6712-4418-b25e-cc8ba69ab07f/-/stretch/off/-/resize/760x/
https://ucarecdn.com/3ed25902-4a51-4628-a057-1e55fbca7856/-/stretch/off/-/resize/760x/
https://ucarecdn.com/5b0b329d-050e-4143-bc92-7f40cdde46f5/-/stretch/off/-/resize/760x/
https://ucarecdn.com/464f96db-6ae3-4875-ac6a-cbede40c4a51/-/stretch/off/-/resize/760x/
https://ucarecdn.com/4facbe78-b4e8-4b7d-8fb0-d3659f46f1b4/-/stretch/off/-/resize/760x/
https://ucarecdn.com/379c6c28-f726-48a3-b59e-1248e1e30443/-/stretch/off/-/resize/760x/
https://ucarecdn.com/631479df-27a8-4047-ae59-63f9167001f2/-/stretch/off/-/resize/760x/
https://ucarecdn.com/8e1e4402-84f0-4d78-b7d8-c48ec437b5af/-/stretch/off/-/resize/760x/
https://ucarecdn.com/f55e6755-198a-408d-8e82-a50370527aed/-/stretch/off/-/resize/760x/
https://ucarecdn.com/5264c896-cf01-4ad9-9216-114c20a388cc/-/stretch/off/-/resize/760x/
https://ucarecdn.com/c6284eae-9be4-4811-b45b-17a5b6e99ad2/-/stretch/off/-/resize/760x/
https://ucarecdn.com/40ff508f-01e5-4417-bee0-20633efc6147/-/stretch/off/-/resize/760x/
https://ucarecdn.com/eaaee377-f1b5-49d7-a7db-d7a1f86b2805/-/stretch/off/-/resize/760x/
https://ucarecdn.com/584c29c8-b521-48ee-8104-6656d4faac97/-/stretch/off/-/resize/760x/
https://ucarecdn.com/798aa641-01fe-4ed2-886b-bac818c5fdfc/-/stretch/off/-/resize/760x/
https://ucarecdn.com/f82be8f5-d517-4642-8fe1-8987b4e530d0/-/stretch/off/-/resize/760x/
https://ucarecdn.com/23b818d0-07c3-40de-a070-c999c1323ff3/-/stretch/off/-/resize/760x/
https://ucarecdn.com/7ca0e7f6-90eb-4254-82ea-58c77e74f6a0/-/stretch/off/-/resize/760x/
https://ucarecdn.com/42dc8c54-2315-453f-9b40-07e332b8ee39/-/stretch/off/-/resize/760x/
https://ucarecdn.com/8e62227c-5acb-4603-abb9-ac0643b7b478/-/stretch/off/-/resize/760x/
https://ucarecdn.com/80713821-5d54-4819-810a-19991502ca56/-/stretch/off/-/resize/760x/
https://ucarecdn.com/35ce83fa-eac1-4326-83e9-e445450b35ce/-/stretch/off/-/resize/760x/
https://ucarecdn.com/3df9ac37-4e86-49e5-9095-28679ab37718/-/stretch/off/-/resize/760x/
https://ucarecdn.com/9e7211c0-b73b-4b1d-8b47-4b1700f9a80f/-/stretch/off/-/resize/760x/
https://ucarecdn.com/1cc3c44b-e4a9-4e37-96cf-afafeb3eb748/-/stretch/off/-/resize/760x/
https://ucarecdn.com/ab52465c-b3d8-4bf6-986a-a4bf815dfaed/-/stretch/off/-/resize/760x/
https://ucarecdn.com/69e43c1d-9fac-4278-bec5-52291c1b1c2b/-/stretch/off/-/resize/760x/
https://ucarecdn.com/0627c11f-522d-48b9-9f17-9ea05b769aaa/-/stretch/off/-/resize/760x/

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM