[英]Scraping multiple pages in Python
我正在嘗試抓取包含 12 個鏈接的頁面。 我需要打開每個鏈接並刮掉它們的所有標題。 當我打開每個頁面時,我會在每個鏈接中面對多個頁面。 但是,我的代碼只能抓取所有這 12 個鏈接的第一頁
通過下面的代碼,我可以打印主頁上存在的所有 12 個鏈接 URL。
url = 'http://mlg.ucd.ie/modules/COMP41680/assignment2/index.html'
res = requests.get (url)
soup = BeautifulSoup(res.text, 'html.parser')
links = soup.find_all("a")
all_urls = []
for link in links[1:]:
link_address ='http://mlg.ucd.ie/modules/COMP41680/assignment2/' + link.get("href")
all_urls.append(link_address)
然后,我把它們都循環了。
for i in range(0,12):
url = all_urls[i]
res = requests.get (url)
soup = BeautifulSoup(res.text, 'html.parser')
標題可以通過以下幾行提取:
title_news = []
news_div = soup.find_all('div', class_ = 'article')
for container in news_div:
title = container.h5.a.text
title_news.append(title)
此代碼的 output 僅包含這 12 個頁面中每一頁的第一頁的標題,而我需要通過這 12 個 URL 中的多個頁面將我的代碼轉換為 go。 如果在適當的循環中定義,則以下內容為我提供了這 12 個鏈接中每個鏈接中存在的所有頁面的鏈接。 (它讀取分頁部分並查找下一頁 URL 鏈接)
page = soup.find('ul', {'class' : 'pagination'}).select('li', {'class': "page-link"})[2].find('a')['href']
我應該如何在我的代碼中使用頁面變量來提取所有這 12 個鏈接中的多個頁面並讀取所有標題,而不僅僅是首頁標題。
您可以使用此代碼從所有頁面獲取所有標題:
import requests
from bs4 import BeautifulSoup
base_url = "http://mlg.ucd.ie/modules/COMP41680/assignment2/"
soup = BeautifulSoup(
requests.get(base_url + "index.html").content, "html.parser"
)
title_news = []
for a in soup.select("#all a"):
next_link = a["href"]
print("Getting", base_url + next_link)
while True:
soup = BeautifulSoup(
requests.get(base_url + next_link).content, "html.parser"
)
for title in soup.select("h5 a"):
title_news.append(title.text)
next_link = soup.select_one('a[aria-label="Next"]')["href"]
if next_link == "#":
break
print("Length of title_news:", len(title_news))
印刷:
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-jan-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-feb-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-mar-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-apr-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-may-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-jun-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-jul-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-aug-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-sep-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-oct-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-nov-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-dec-001.html
Length of title_news: 16226
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.