[英]How to scrape all pages without knowing how many pages there are
我有以下 function 來收集所有價格,但我在抓取總頁數時遇到問題。 在不知道有多少頁面的情況下,我如何能夠瀏覽所有頁面?
import requests
from bs4 import BeautifulSoup
import pandas as pd
import itertools
def get_data(page):
url = 'https://www.remax.ca/bc/vancouver--real-estate?page='+page
page = requests.get(url)
soup = BeautifulSoup(page,'html.parser')
price = soup.find_all('h3', {'class' : 'price'})
price = pd.DataFrame([(p.text) for p in price]).rename(columns = {0:'Price'})
return price
我試過但似乎沒有用
for pages in itertools.count(start=1):
try:
table = get_data('1').append(table)
except Exception:
break
這是遞歸的好機會,前提是您預計不會超過 1000 頁,因為我認為 Python 只允許最大堆棧深度為 1000:
import requests
from bs4 import BeautifulSoup
import pandas as pd
def get_prices(page=1, prices=[], depth=0, max_depth=100):
if depth >= max_depth:
return prices
url = 'https://www.remax.ca/bc/vancouver--real-estate?page={page}'.format(page=page)
r = requests.get(url)
if not r:
return prices
if r.status_code != 200:
return prices
soup = BeautifulSoup(r.text, 'html.parser')
price = soup.find_all('h3', {'class' : 'price'})
price = pd.DataFrame([(p.text) for p in price]).rename(columns = {0:'Price'})
prices.append(price)
return get_prices(page=page+1, prices=prices, depth=depth+1)
prices = get_prices()
所以 get_prices function 首先用默認參數調用自己。 然后,它會不斷調用自己,並在每次調用 function 的價格時添加額外的價格,直到它到達下一頁不產生狀態代碼 200 的點,或者它達到您指定的最大遞歸深度。
或者,如果您不喜歡遞歸,或者您需要一次查詢超過 1000 個頁面,那么您可以使用更簡單但不太有趣的 while 循環:
import requests
from bs4 import BeautifulSoup
import pandas as pd
def get_prices():
prices=[]
page = 1
while True:
url = 'https://www.remax.ca/bc/vancouver--real-estate?page={page}'.format(page=page)
r = requests.get(url)
if not r:
break
if r.status_code != 200:
break
soup = BeautifulSoup(r.text, 'html.parser')
price = soup.find_all('h3', {'class' : 'price'})
price = pd.DataFrame([(p.text) for p in price]).rename(columns = {0:'Price'})
prices.append(price)
page += 1
return prices
prices = get_prices()
試試這個
def get_data(price, page):
url = 'https://www.remax.ca/bc/vancouver--real-estate?page='+page
page = urlopen(url)
soup = BeautifulSoup(page,'html.parser')
price = soup.find_all('h3', {'class' : 'price'})
price = pd.DataFrame([(p.text) for p in price]).rename(columns = {0:'Price'})
price = dict()
for page in itertools.count(start=1):
try:
get_data(price, str(page))
except Exception:
break
也許您應該將“get_data('1')”更改為“get_data(str(page))”?
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.