简体   繁体   中英

Can't get all results in tripadvisor using python al beautifulsoup due to pagination

I am trying to get links of restaurants but i can only get the first 30 and not all the others. Restaurants in Madrid Area are hundreads, the pagination only shows 30 in each page and the following code only get those 30

import re
import requests
from openpyxl import Workbook
from bs4 import BeautifulSoup as b

city_name = 'Madrid'
geo_code = '187514'

headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)       Chrome/91.0.4472.124 Safari/537.36"
}

data = requests.get(
"https://www.tripadvisor.com//Restaurants-g{}-{}.html".format(geo_code, city_name), headers=headers
).text

for link in re.findall(r'"detailPageUrl":"(.*?)"', data):
        print("https://www.tripadvisor.com.sg/" + link)
        next_link = "https://www.tripadvisor.com.sg/" + link
        f.write('%s\n' % next_link)

Found the solution, had to add ao with number of the result in the url like:

"https://www.tripadvisor.com//Restaurants-g{}-{}-{}.html".format(geo_code, city_name, n_review), headers=headers

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2025 STACKOOM.COM