[英]Python getting incomplete next page URL (BeautifulSoup, Request)
我對項目的 Python 和 Web Scraping, http://books.toscrape.com/index.html非常陌生,但我堅持使用分頁邏輯。 到目前為止,我設法獲得了每個類別、書籍鏈接和我需要的信息,但我正在努力為每個類別抓取下一頁 URL。 第一個問題是下一頁 URL 不完整(但我可以管理),第二個問題是基礎 URL 我必須對每個類別進行更改。 這是我的代碼:
import requests
from bs4 import BeautifulSoup
project = []
url = 'http://books.toscrape.com'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
links = []
categories = soup.findAll("ul", class_="nav nav-list")
for category in categories:
hrefs = category.find_all('a', href=True)
for href in hrefs:
links.append(href['href'])
new_links = [element.replace("catalogue", "http://books.toscrape.com/catalogue") for element in links]
del new_links[0]
page = 0
books = []
for link in new_links:
r2 = requests.get(link).text
book_soup = BeautifulSoup(r2, "html.parser")
print("category: " + link)
nextpage = True
while nextpage:
book_link = book_soup.find_all(class_="product_pod")
for product in book_link:
a = product.find('a')
full_link = a['href'].replace("../../..", "")
print("book: " + full_link)
books.append("http://books.toscrape.com/catalogue" + full_link)
if book_soup.find('li', class_='next') is None:
nextpage = False
page += 1
print("end of pagination")
else:
next_page = book_soup.select_one('li.next>a')
print(next_page)
我正在努力的部分是“for link in new_links”中的 WHILE 循環。 我主要是在尋找任何可以幫助我的例子。 謝謝!
如果你不想在分頁所有結果時直接通過http://books.toscrape.com/index.html抓取鏈接,你可以這樣實現你的目標:
from bs4 import BeautifulSoup
import requests
base_url = 'http://books.toscrape.com/'
soup = BeautifulSoup(requests.get(base_url).text)
books = []
for cat in soup.select('.nav-list ul a'):
cat_url = base_url+cat.get('href').rsplit('/',1)[0]
url = cat_url
while True:
soup = BeautifulSoup(requests.get(url).text)
##print(url)
books.extend(['http://books.toscrape.com/catalogue/'+a.get('href').strip('../../../') for a in soup.select('article h3 a')])
if soup.select_one('li.next a'):
url = f"{cat_url}/{soup.select_one('li.next a').get('href')}"
else:
break
books
因為結果是一樣的,我建議跳過類別:
from bs4 import BeautifulSoup
import requests
baseurl = 'http://books.toscrape.com/'
url = 'https://books.toscrape.com/catalogue/page-1.html'
soup = BeautifulSoup(requests.get(base_url).text)
books = []
while True:
soup = BeautifulSoup(requests.get(url).text)
for a in soup.select('article h3 a'):
bsoup = BeautifulSoup(requests.get(base_url+'catalogue/'+a.get('href')).content)
print(base_url+'catalogue/'+a.get('href'))
data = {
'title': bsoup.h1.text.strip(),
'category': bsoup.select('.breadcrumb li')[-2].text.strip(),
'url': base_url+'catalogue/'+a.get('href')
### add what ever is needed
}
data.update(dict(row.stripped_strings for row in bsoup.select('table tr')))
books.append(data)
if soup.select_one('li.next a'):
url = f"{url.rsplit('/',1)[0]}/{soup.select_one('li.next a').get('href')}"
else:
break
books
[{'title': 'A Light in the Attic',
'category': 'Poetry',
'url': 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
'UPC': 'a897fe39b1053632',
'Product Type': 'Books',
'Price (excl. tax)': '£51.77',
'Price (incl. tax)': '£51.77',
'Tax': '£0.00',
'Availability': 'In stock (22 available)',
'Number of reviews': '0'},
{'title': 'Tipping the Velvet',
'category': 'Historical Fiction',
'url': 'http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html',
'UPC': '90fa61229261140a',
'Product Type': 'Books',
'Price (excl. tax)': '£53.74',
'Price (incl. tax)': '£53.74',
'Tax': '£0.00',
'Availability': 'In stock (20 available)',
'Number of reviews': '0'},
{'title': 'Soumission',
'category': 'Fiction',
'url': 'http://books.toscrape.com/catalogue/soumission_998/index.html',
'UPC': '6957f44c3847a760',
'Product Type': 'Books',
'Price (excl. tax)': '£50.10',
'Price (incl. tax)': '£50.10',
'Tax': '£0.00',
'Availability': 'In stock (20 available)',
'Number of reviews': '0'},...]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.