簡體   English   中英

無法使用 BeautifulSoup 提取網頁中的某些鏈接

[英]Unable to extract some links in a webpage using BeautifulSoup

我正在嘗試從網頁中抓取圖像。 它有很多圖像,鏈接到新頁面,我想點擊每個並從子網頁中提取圖像。 為此,首先,我需要原始頁面中所有“鏈接”的列表。 我有以下代碼 -

# import necessary libraries
from bs4 import BeautifulSoup
import requests
import re


# function to extract html document from given url
def getHTMLdocument(url):
    # request for HTML document of given url
    response = requests.get(url)

    # response will be provided in JSON format
    return response.text


# assign required credentials
# assign URL
url_to_scrape = "https://www.wikiart.org/en/paintings-by-style/magic-realism?select=featured#!#filterName:featured,viewType:masonry"

# create document
html_document = getHTMLdocument(url_to_scrape)

# create soap object
soup = BeautifulSoup(html_document, 'html.parser')

# find all the anchor tags with "href"
# attribute starting with "https://"
for link in soup.find_all('a', attrs={'href': re.compile("^https://")}):
    # display the actual urls
    print(link.get('href'))

但是,這只給了我以下列表-

https://www.globalcitizen.org/en/content/ways-to-help-ukraine-conflict/
https://www.wikiart.org/en/giovanni-bellini/leonardo-loredan-1501-1
https://www.1st-art-gallery.com/
https://www.wikiart.org/en/giovanni-bellini/leonardo-loredan-1501-1
https://www.facebook.com/wikiart.org
https://twitter.com/wikipaintings
https://www.1st-art-gallery.com/
https://www.1st-art-gallery.com/
https://wikiart.uservoice.com
https://itunes.apple.com/us/app/wikiart/id1235995167
https://play.google.com/store/apps/details?id=com.ilit.wikipaintings
https://www.facebook.com/wikiart.org
https://twitter.com/wikipaintings

這些都是網頁中的所有鏈接,但是如果您單擊其中一個圖像,它會丟失將生成的鏈接。 單擊圖像會將您重定向到https://的另一個標准頁面,因此我不確定我在這里缺少什么。

我可以看到圖像不是“普通鏈接”,因為如果我用 ctrl 單擊它們,新頁面會在同一個選項卡中打開,而不是新頁面。 我猜這與為什么那些沒有出現在 BeautifulSoup 中有關? 但是我不知道那些類型的鏈接叫什么,所以我不知道要搜索什么。

圖像的鏈接嵌入在 HTML 源代碼中,但您需要先將它們取出。 然后,一旦你有了圖像源 URL,你就可以下載它們,如果你喜歡的話。

就是這樣:

import json
import re

import requests
from bs4 import BeautifulSoup

url = "https://www.wikiart.org/en/paintings-by-style/magic-realism?select=featured#!#filterName:featured,viewType:masonry"

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:101.0) Gecko/20100101 Firefox/101.0",
}

soup = (
    BeautifulSoup(requests.get(url, headers=headers).text, "lxml")
    .find("div", class_="artworks-by-dictionary")["ng-init"]
)

images = [
    i["image"] for i in json.loads(re.search(r":\s(\[.*\])", soup).group(1))
]

print("\n".join(images))

輸出:

https://uploads1.wikiart.org/images/felix-vallotton/portrait-of-thadee-nathanson-1897.jpg
https://uploads1.wikiart.org/images/felix-vallotton/the-source-1897.jpg
https://uploads3.wikiart.org/00236/images/telemaco-signorini/pag026.jpg
https://uploads6.wikiart.org/images/felix-vallotton/laid-down-woman-sleeping-1899.jpg
https://uploads6.wikiart.org/images/felix-vallotton/sunset-1910.jpg
https://uploads8.wikiart.org/images/felix-vallotton/red-sand-and-snow-1901.jpg
https://uploads2.wikiart.org/images/felix-vallotton/the-pier-of-honfleur-1901.jpg
https://uploads4.wikiart.org/images/felix-vallotton/the-pont-neuf-1901.jpg
https://uploads7.wikiart.org/images/pierre-roy/les-mauvaises-graines-1901.jpg
https://uploads2.wikiart.org/images/felix-vallotton/the-way-to-locquirec-1902.jpg
https://uploads8.wikiart.org/images/felix-vallotton/the-five-painters-1902.jpg
https://uploads1.wikiart.org/images/felix-vallotton/the-toilet-1905.jpg

and more...

編輯:

實際上,還有一種更簡單的方法來獲取這些數據:

import json

import requests

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:101.0) Gecko/20100101 Firefox/101.0",
    "X-Requested-With": "XMLHttpRequest",
    "Referer": "https://www.wikiart.org/en/paintings-by-style/magic-realism?select=featured"
}

url = "https://www.wikiart.org/en/paintings-by-style/magic-realism?select=featured&json=2&layout=new&page=1&resultType=masonry"
paintings = requests.get(url, headers=headers).json()["Paintings"]

for painting in paintings:
    print(f"{painting['artistName']} - {painting['title']}\n{painting['image']}")

輸出:

Felix Vallotton - Portrait of Thadee Nathanson
https://uploads1.wikiart.org/images/felix-vallotton/portrait-of-thadee-nathanson-1897.jpg
Felix Vallotton - The Source
https://uploads1.wikiart.org/images/felix-vallotton/the-source-1897.jpg
Telemaco Signorini - The morning toilet
https://uploads3.wikiart.org/00236/images/telemaco-signorini/pag026.jpg
Felix Vallotton - Laid down woman, sleeping
https://uploads6.wikiart.org/images/felix-vallotton/laid-down-woman-sleeping-1899.jpg
Felix Vallotton - Sunset
https://uploads6.wikiart.org/images/felix-vallotton/sunset-1910.jpg
Felix Vallotton - Red Sand and Snow
https://uploads8.wikiart.org/images/felix-vallotton/red-sand-and-snow-1901.jpg
Felix Vallotton - The pier of Honfleur
https://uploads2.wikiart.org/images/felix-vallotton/the-pier-of-honfleur-1901.jpg

and more ...

獎金

通過增加 URL 中的page值,您可以對搜索進行分頁。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM