簡體   English   中英

如何在多個頁面上迭代時抓取網站

[英]How to scrape website while iterate on multiple pages

嘗試使用 python beautifulsoup 抓取該網站: https://www.leandjaya.com/katalog

在導航網站的多個頁面並使用 python 抓取它時遇到了一些挑戰,該網站有 11 個頁面,並且很想知道實現此目的的最佳選擇,例如使用 for 循環,如果頁面不存在,它將打破循環。

this is my initial code, I have set a big number 50, however seems this is not a good option.
page = 1
while page != 50:
    url=f"https://www.leandjaya.com/katalog/ss/1/{page}/"
    main = requests.get(url)
    pmain = BeautifulSoup(main.text,'lxml')
    page = page + 1
Sample output:
https://www.leandjaya.com/katalog/ss/1/1/
https://www.leandjaya.com/katalog/ss/1/2/
https://www.leandjaya.com/katalog/ss/1/3/
https://www.leandjaya.com/katalog/ss/1/<49>/

這是提取該信息並將其顯示在 dataframe 中的一種方法,基於未知數量的數據頁面:

import requests
import pandas as pd
from bs4 import BeautifulSoup as bs

cars_list = []
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
s = requests.Session()
s.headers.update(headers)
counter = 1
while True:
    try:
        print('page:', counter)
        url = f'https://www.leandjaya.com/katalog/ss/1/{counter}/'
        r = s.get(url)
        soup = bs(r.text, 'html.parser')
        cars_cards = soup.select('div.item')
        if len(cars_cards) < 1:
            print('all done, no cars left')
            break
        for car in cars_cards:
            car_name = car.select_one('div.item-title').get_text(strip=True)
            car_price = car.select_one('div.item-price').get_text(strip=True)
            cars_list.append((car_name, car_price))
        counter = counter + 1
    except Exception as e:
        print('all done')
        break
df = pd.DataFrame(cars_list, columns = ['Car', 'Price'])
print(df)

結果:

page: 1
page: 2
page: 3
page: 4
page: 5
page: 6
page: 7
page: 8
page: 9
page: 10
page: 11
page: 12
all done, no cars left
Car Price
0   HONDA CRV 4X2 2.0 AT 2001   DP20jt
1   DUJUAL XPANDER 1.5 GLS 2018 MANUAL  DP53jt
2   NISSAN JUKE 1.5 CVT 2011 MATIC  DP33jt
3   Mitsubishi Xpander 1.5 Exceed Manual 2018   DP50jt
4   BMW X1 2.0 AT SDRIVE 2011   DP55jt
... ... ...
146 Daihatsu Sigra 1.2 R AT DP130jt
147 Daihatsu Xenia Xi 2010  DP85jt
148 Suzuki Mega Carry Pick Up 1.5   DP90jt
149 Honda Mobilio Tipe E Prestige   DP150jt
150 Honda Freed Tipe S  Rp. 170jtRp. 165jt
151 rows × 2 columns

上面使用的軟件包的相關文檔可以在以下位置找到:

https://beautiful-soup-4.readthedocs.io/en/latest/index.html

https://requests.readthedocs.io/en/latest/

https://pandas.pydata.org/pandas-docs/stable/index.html

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM