簡體   English   中英

如何在不知道有多少頁面的情況下抓取所有頁面

[英]How to scrape all pages without knowing how many pages there are

我有以下 function 來收集所有價格,但我在抓取總頁數時遇到問題。 在不知道有多少頁面的情況下,我如何能夠瀏覽所有頁面?

import requests
from bs4 import BeautifulSoup
import pandas as pd
import itertools

def get_data(page):
    url = 'https://www.remax.ca/bc/vancouver--real-estate?page='+page
    page = requests.get(url)
    soup = BeautifulSoup(page,'html.parser')
    price = soup.find_all('h3', {'class' : 'price'})
    price = pd.DataFrame([(p.text) for p in price]).rename(columns = {0:'Price'})
    return price

我試過但似乎沒有用

for pages in itertools.count(start=1):
    try:
        table = get_data('1').append(table)
    except Exception:
        break

這是遞歸的好機會,前提是您預計不會超過 1000 頁,因為我認為 Python 只允許最大堆棧深度為 1000:

import requests
from bs4 import BeautifulSoup
import pandas as pd

def get_prices(page=1, prices=[], depth=0, max_depth=100):

    if depth >= max_depth:
        return prices

    url = 'https://www.remax.ca/bc/vancouver--real-estate?page={page}'.format(page=page)
    
    r = requests.get(url)
    if not r:
        return prices
    if r.status_code != 200:
        return prices

    soup = BeautifulSoup(r.text, 'html.parser')
    price = soup.find_all('h3', {'class' : 'price'})
    price = pd.DataFrame([(p.text) for p in price]).rename(columns = {0:'Price'})

    prices.append(price)
    
    return get_prices(page=page+1, prices=prices, depth=depth+1)

prices = get_prices()

所以 get_prices function 首先用默認參數調用自己。 然后,它會不斷調用自己,並在每次調用 function 的價格時添加額外的價格,直到它到達下一頁不產生狀態代碼 200 的點,或者它達到您指定的最大遞歸深度。

或者,如果您不喜歡遞歸,或者您需要一次查詢超過 1000 個頁面,那么您可以使用更簡單但不太有趣的 while 循環:

import requests
from bs4 import BeautifulSoup
import pandas as pd

def get_prices():

    prices=[]
    page = 1

    while True:

        url = 'https://www.remax.ca/bc/vancouver--real-estate?page={page}'.format(page=page)
        
        r = requests.get(url)
        if not r:
            break
        if r.status_code != 200:
            break

        soup = BeautifulSoup(r.text, 'html.parser')
        price = soup.find_all('h3', {'class' : 'price'})
        price = pd.DataFrame([(p.text) for p in price]).rename(columns = {0:'Price'})

        prices.append(price)

        page += 1
    
    return prices

prices = get_prices()

試試這個

def get_data(price, page):
    url = 'https://www.remax.ca/bc/vancouver--real-estate?page='+page
    page = urlopen(url)
    soup = BeautifulSoup(page,'html.parser')
    price = soup.find_all('h3', {'class' : 'price'})
    price = pd.DataFrame([(p.text) for p in price]).rename(columns = {0:'Price'})

price = dict()
for page in itertools.count(start=1):
    try:
        get_data(price, str(page))
    except Exception:
        break

也許您應該將“get_data('1')”更改為“get_data(str(page))”?

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM