[英]How to scrape website while iterate on multiple pages
嘗試使用 python beautifulsoup 抓取該網站: https://www.leandjaya.com/katalog
在導航網站的多個頁面並使用 python 抓取它時遇到了一些挑戰,該網站有 11 個頁面,並且很想知道實現此目的的最佳選擇,例如使用 for 循環,如果頁面不存在,它將打破循環。
this is my initial code, I have set a big number 50, however seems this is not a good option.
page = 1
while page != 50:
url=f"https://www.leandjaya.com/katalog/ss/1/{page}/"
main = requests.get(url)
pmain = BeautifulSoup(main.text,'lxml')
page = page + 1
Sample output:
https://www.leandjaya.com/katalog/ss/1/1/
https://www.leandjaya.com/katalog/ss/1/2/
https://www.leandjaya.com/katalog/ss/1/3/
https://www.leandjaya.com/katalog/ss/1/<49>/
這是提取該信息並將其顯示在 dataframe 中的一種方法,基於未知數量的數據頁面:
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
cars_list = []
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
s = requests.Session()
s.headers.update(headers)
counter = 1
while True:
try:
print('page:', counter)
url = f'https://www.leandjaya.com/katalog/ss/1/{counter}/'
r = s.get(url)
soup = bs(r.text, 'html.parser')
cars_cards = soup.select('div.item')
if len(cars_cards) < 1:
print('all done, no cars left')
break
for car in cars_cards:
car_name = car.select_one('div.item-title').get_text(strip=True)
car_price = car.select_one('div.item-price').get_text(strip=True)
cars_list.append((car_name, car_price))
counter = counter + 1
except Exception as e:
print('all done')
break
df = pd.DataFrame(cars_list, columns = ['Car', 'Price'])
print(df)
結果:
page: 1
page: 2
page: 3
page: 4
page: 5
page: 6
page: 7
page: 8
page: 9
page: 10
page: 11
page: 12
all done, no cars left
Car Price
0 HONDA CRV 4X2 2.0 AT 2001 DP20jt
1 DUJUAL XPANDER 1.5 GLS 2018 MANUAL DP53jt
2 NISSAN JUKE 1.5 CVT 2011 MATIC DP33jt
3 Mitsubishi Xpander 1.5 Exceed Manual 2018 DP50jt
4 BMW X1 2.0 AT SDRIVE 2011 DP55jt
... ... ...
146 Daihatsu Sigra 1.2 R AT DP130jt
147 Daihatsu Xenia Xi 2010 DP85jt
148 Suzuki Mega Carry Pick Up 1.5 DP90jt
149 Honda Mobilio Tipe E Prestige DP150jt
150 Honda Freed Tipe S Rp. 170jtRp. 165jt
151 rows × 2 columns
上面使用的軟件包的相關文檔可以在以下位置找到:
https://beautiful-soup-4.readthedocs.io/en/latest/index.html
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.