Python 下一頁不完整 URL (BeautifulSoup, Request)

Question

我對項目的 Python 和 Web Scraping， http://books.toscrape.com/index.html非常陌生，但我堅持使用分頁邏輯。 到目前為止，我設法獲得了每個類別、書籍鏈接和我需要的信息，但我正在努力為每個類別抓取下一頁 URL。 第一個問題是下一頁 URL 不完整（但我可以管理），第二個問題是基礎 URL 我必須對每個類別進行更改。 這是我的代碼：

import requests
from bs4 import BeautifulSoup

project = []

url = 'http://books.toscrape.com'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")

links = []
categories = soup.findAll("ul", class_="nav nav-list")
for category in categories:
    hrefs = category.find_all('a', href=True)
    for href in hrefs:
        links.append(href['href'])
new_links = [element.replace("catalogue", "http://books.toscrape.com/catalogue") for element in links]
del new_links[0]

page = 0
books = []


for link in new_links:
    r2 = requests.get(link).text
    book_soup = BeautifulSoup(r2, "html.parser")
    print("category: " + link)
    nextpage = True
    while nextpage:
        book_link = book_soup.find_all(class_="product_pod")
        for product in book_link:
            a = product.find('a')
            full_link = a['href'].replace("../../..", "")
            print("book: " + full_link)
            books.append("http://books.toscrape.com/catalogue" + full_link)
        if book_soup.find('li', class_='next') is None:
            nextpage = False
            page += 1
            print("end of pagination")
        else:
            next_page = book_soup.select_one('li.next>a')
            print(next_page)

我正在努力的部分是“for link in new_links”中的 WHILE 循環。 我主要是在尋找任何可以幫助我的例子。 謝謝！

Answer 1

如果你不想在分頁所有結果時直接通過http://books.toscrape.com/index.html抓取鏈接，你可以這樣實現你的目標：

from bs4 import BeautifulSoup
import requests

base_url = 'http://books.toscrape.com/'
soup = BeautifulSoup(requests.get(base_url).text)

books = []

for cat in soup.select('.nav-list ul a'):
    
    cat_url = base_url+cat.get('href').rsplit('/',1)[0]
    url = cat_url
    while True:
        soup = BeautifulSoup(requests.get(url).text)
        ##print(url)
        books.extend(['http://books.toscrape.com/catalogue/'+a.get('href').strip('../../../') for a in soup.select('article h3 a')])
        if soup.select_one('li.next a'):
            url = f"{cat_url}/{soup.select_one('li.next a').get('href')}"
        else:
            break
books

因為結果是一樣的，我建議跳過類別：

from bs4 import BeautifulSoup
import requests

baseurl = 'http://books.toscrape.com/'
url = 'https://books.toscrape.com/catalogue/page-1.html'
soup = BeautifulSoup(requests.get(base_url).text)

books = []

while True:
    soup = BeautifulSoup(requests.get(url).text)
    
    for a in soup.select('article h3 a'):
        bsoup = BeautifulSoup(requests.get(base_url+'catalogue/'+a.get('href')).content)
        print(base_url+'catalogue/'+a.get('href'))
        data = {
            'title': bsoup.h1.text.strip(),
            'category': bsoup.select('.breadcrumb li')[-2].text.strip(),
            'url': base_url+'catalogue/'+a.get('href')
            ### add what ever is needed
        }
        data.update(dict(row.stripped_strings for row in bsoup.select('table tr')))
        books.append(data)
    if soup.select_one('li.next a'):
        url = f"{url.rsplit('/',1)[0]}/{soup.select_one('li.next a').get('href')}"
    else:
        break
books

Output

[{'title': 'A Light in the Attic',
  'category': 'Poetry',
  'url': 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
  'UPC': 'a897fe39b1053632',
  'Product Type': 'Books',
  'Price (excl. tax)': '£51.77',
  'Price (incl. tax)': '£51.77',
  'Tax': '£0.00',
  'Availability': 'In stock (22 available)',
  'Number of reviews': '0'},
 {'title': 'Tipping the Velvet',
  'category': 'Historical Fiction',
  'url': 'http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html',
  'UPC': '90fa61229261140a',
  'Product Type': 'Books',
  'Price (excl. tax)': '£53.74',
  'Price (incl. tax)': '£53.74',
  'Tax': '£0.00',
  'Availability': 'In stock (20 available)',
  'Number of reviews': '0'},
 {'title': 'Soumission',
  'category': 'Fiction',
  'url': 'http://books.toscrape.com/catalogue/soumission_998/index.html',
  'UPC': '6957f44c3847a760',
  'Product Type': 'Books',
  'Price (excl. tax)': '£50.10',
  'Price (incl. tax)': '£50.10',
  'Tax': '£0.00',
  'Availability': 'In stock (20 available)',
  'Number of reviews': '0'},...]

Python 下一頁不完整 URL (BeautifulSoup, Request)

問題描述

1 個解決方案

解決方案1
0 2022-06-11 17:36:20

Output

Python 下一頁不完整 URL (BeautifulSoup, Request)

問題描述

1 個解決方案

解決方案1 0 2022-06-11 17:36:20

Output

解決方案1
0 2022-06-11 17:36:20