簡體   English   中英

我正在嘗試抓取網站的鏈接,並在已經抓取的鏈接中抓取鏈接

[英]I am trying to scrape a website for links and also scrape the links inside the already scraped links

我正在嘗試抓取網站的鏈接,抓取后,我還想查看我抓取的鏈接是否只是文章或包含更多鏈接,如果是,我也想抓取這些鏈接。 我正在嘗試使用 BeautifulSoup 4 來實現它,這就是我到目前為止所擁有的代碼:

import requests
from bs4 import BeautifulSoup
url ='https://www.lbbusinessjournal.com/'
try:
    r = requests.get(url, headers={'User-Agent': user_agent})
    soup = BeautifulSoup(r.text, 'html.parser')
    for post in soup.find_all(['h3', 'li'], class_=['entry-title td-module-title', 'menu-item']):
        link = post.find('a').get('href')
        print(link)
        r = requests.get(link, headers={'User-Agent': user_agent})
        soup1 = BeautifulSoup(r.text, 'html.parser')
        for post1 in soup1.find_all('h3', class_='entry-title td-module-title'):
            link1 = post1.find('a').get('href')
            print(link1)
except Exception as e:
    print(e)

我想要頁面https://www.lbbusinessjournal.com/的鏈接,並在我從該頁面獲得的鏈接中搜索可能的鏈接,例如https://www.lbbusinessjournal.com/news/ ,我想要這些鏈接在https://www.lbbusinessjournal.com/news/內也是如此。 到目前為止,我只從主頁獲取鏈接。

嘗試從您的except子句中raise e ,您將看到錯誤

AttributeError: 'NoneType' object 沒有屬性 'get'

來自行link1 = post1.find('a').get('href') ,其中post1.find('a')返回None - 這是因為您檢索的 HTML h3元素中至少有一個沒有一個a元素 - 事實上,看起來鏈接在 HTML 中被注釋掉了。

相反,您應該將此post1.find('a').get('href')調用拆分為兩個步驟,並在嘗試獲取'href'之前檢查post1.find('a')返回的元素是否不是None 'href'屬性,即:

for post1 in soup1.find_all('h3', class_='entry-title td-module-title'):                                                     
    element = post1.find('a')                                           
    if element is not None:                                             
        link1 = element.get('href')                                     
        print(link1)

Output 通過以下更改運行您的代碼:

https://www.lbbusinessjournal.com/
https://www.lbbusinessjournal.com/this-virus-doesnt-have-borders-port-official-warns-of-pandemics-future-economic-impact/
https://www.lbbusinessjournal.com/pharmacy-and-grocery-store-workers-call-for-increased-protections-against-covid-19/
https://www.lbbusinessjournal.com/up-close-and-personal-grooming-businesses-struggle-in-times-of-social-distancing/
https://www.lbbusinessjournal.com/light-at-the-end-of-the-tunnel-long-beach-secures-contract-for-new-major-convention/
https://www.lbbusinessjournal.com/hospitals-prepare-for-influx-of-coronavirus-patients-officials-worry-it-wont-be-enough/
https://www.lbbusinessjournal.com/portside-keeping-up-with-the-port-of-long-beach-18/
https://www.lbbusinessjournal.com/news/
...

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM