![](/img/trans.png)
[英]Beautifulsoup/Selenium how to scrape website until next page is disabled?
[英]Scrape web untill the “next” page is disabled
url = 'https://www.tripadvisor.ie/Attraction_Review-g295424-d2038312-Reviews-Global_Village-Dubai_Emirate_of_Dubai.html'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
def get_links():
review_links = []
for review_link in soup.find_all('a', {'class':'title'},href=True):
review_link = review_link['href']
review_links.append(review_link)
return review_links
link = 'https://www.tripadvisor.ie'
review_urls = []
for i in get_links():
review_url = link + i
print (review_url)
review_urls.append(review_url)
在這里,此代碼保存了此網頁上存在的所有超鏈接-但我想將頁面上的所有超鏈接都刮到319。禁用分頁時無法實現
您可以在URL中更改一個參數來循環並獲取所有評論。 所以我只是添加了一個循環並請求所有網址
def get_page(index):
url = "https://www.tripadvisor.ie/Attraction_Review-g295424-d2038312-Reviews-or{}-Global_Village-Dubai_Emirate_of_Dubai.html".format(str(index))
html = requests.get(url)
page = soup(html.text, 'html.parser')
return page
nb_review = 3187
for i in range(0, nb_review, 10):
page = get_page(i)
使用您的代碼段的完整代碼是:
from bs4 import BeautifulSoup as soup
import requests
def get_page(index):
url = "https://www.tripadvisor.ie/Attraction_Review-g295424-d2038312-Reviews-or{}-Global_Village-Dubai_Emirate_of_Dubai.html".format(str(index))
html = requests.get(url)
page = soup(html.text, 'html.parser')
return page
def get_links(page):
review_links = []
for review_link in page.find_all('a', {'class':'title'},href=True):
review_link = review_link['href']
review_links.append(review_link)
return review_links
link = 'https://www.tripadvisor.ie'
review_urls = []
nb_review = 3187
for i in range(0, nb_review, 10):
page = get_page(i)
for i in get_links(page):
review_url = link + i
review_urls.append(review_url)
print(len(review_urls))
輸出:
3187
編輯:
您顯然可以抓取首頁並獲得評論號以升級代碼以使其更具可定制性
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.