獲取下一頁網址

Question

現在，我嘗試從網頁上抓取所有URL。 它總共有5個類別，每個類別都有不同的頁面（每頁包含10個文章）。

例如：

Categories   Pages
Banana          5
Apple          14
Cherry          7
Melon           6
Berry           2

碼：

import requests
from bs4 import BeautifulSoup
import re
from urllib.parse import urljoin


res = requests.get('http://www.abcde.com/SearchParts')
soup = BeautifulSoup(res.text,"lxml")
href = [ a["href"] for a in soup.findAll("a", {"id" : re.compile("parts_img.*")})]
b1 =[]
for url in href:
    b1.append("http://www.abcde.com"+url)
print (b1)

從主頁“ http://www.abcde.com/SearchParts ”中，我可以抓取每個類別的首頁URL。 B1是第一頁URL的列表。

像這樣：

Categories   Pages                       url
Banana          1     http://www.abcde.com/A
Apple           1     http://www.abcde.com/B
Cherry          1     http://www.abcde.com/C
Melon           1     http://www.abcde.com/E
Berry           1     http://www.abcde.com/F

然后，我使用b1的源代碼來抓取下一頁的URL。 因此b2是第二頁URL的列表。

碼：

b2=[]
for url in b1:
    res2 = requests.get(url).text
    soup2 = BeautifulSoup(res2,"lxml")
    url_n=soup2.find('',rel = 'next')['href']
    b2.append("http://www.abcde.com"+url_n)
print(b2)

像這樣：

Categories   Pages                       url
    Banana          1     http://www.abcde.com/A/s=1&page=2
    Apple           1     http://www.abcde.com/B/s=9&page=2
    Cherry          1     http://www.abcde.com/C/s=11&page=2
    Melon           1     http://www.abcde.com/E/s=7&page=2
    Berry           1     http://www.abcde.com/F/s=5&page=2

現在，當我嘗試執行第三頁時，這是一個錯誤，因為Berry的第二頁是最后一頁，因此在源代碼中沒有“下一頁”。 特別是當每個類別都有不同的頁面/ URL時，我該怎么辦？

整個代碼（直到出現錯誤）：

import requests
from bs4 import BeautifulSoup
import re
from urllib.parse import urljoin


res = requests.get('http://www.ca2-health.com/frontend/SearchParts')
soup = BeautifulSoup(res.text,"lxml")
href = [ a["href"] for a in soup.findAll("a", {"id" : re.compile("parts_img.*")})]
b1 =[]
for url in href:
    b1.append("http://www.ca2-health.com"+url)
print (b1)
print("===================================================")
b2=[]
for url in b1:
    res2 = requests.get(url).text
    soup2 = BeautifulSoup(res2,"lxml")
    url_n=soup2.find('',rel = 'next')['href']
    b2.append("http://www.ca2-health.com"+url_n)
print(b2)
print("===================================================")
b3=[]
for url in b2:
    res3 = requests.get(url).text
    soup3 = BeautifulSoup(res3,"lxml")
    url_n=soup3.find('',rel = 'next')['href']
    b3.append("http://www.ca2-health.com"+url_n)
print(b3)

然后，將b1，b2，b3和...作為列表，此后，我將獲得此頁面中的所有URL。

Answer 1

我猜您正在收到KeyError 。 處理異常並繼續循環。 如果您遇到KeyError請執行以下操作：

try:
    url_n = soup3.find(rel='next')['href']
except KeyError:
    continue

要么

try:
    url_n = soup3.find(rel='next').get('href')
except AttributeError:
    continue

獲取下一頁網址

問題描述

1 個解決方案

解決方案1
0 2018-01-03 09:06:09

獲取下一頁網址

問題描述

1 個解決方案

解決方案1 0 2018-01-03 09:06:09

解決方案1
0 2018-01-03 09:06:09