简体   繁体   中英

BeautifulSoup web scraping multiple pages URL doesn't change

When using beautiful soup to webscrape reviews I have an issue when it comes to "All Audience" Reviews. The URL doesn't update when changing review list pages.

Here is an example: https://www.rottentomatoes.com/m/midsommar/reviews?type=user

No change in the URL is made when clicking next.

Based on some of the other answers available in another thread I tried (and I might be saying this wrong) tracking xhr request, I believe the exact script that is being run is what I have highlighted in the picture here(I don't have 10 reputation so can't post image).

Network Method Post

When I look into the header of that GET action I see a Request URL, and when I try that it has all of the info I need, the problem is I don't know their naming convention for going to the next page. Below is how the RequestURLs change between pages.

Request URL page 1->2

Request URL page 2->3

How can I get beautiful soup to iterate over these?

Thanks!

Below should be enough code to get by attempting this, ignore some of the naming.

from bs4 import BeautifulSoup as soup
from urllib.request import Request, urlopen

x = input('What Movie?').replace(" ", "_").lower()

req_rot = Request('https://www.rottentomatoes.com/m/' + str(x) + '/reviews?type=user', headers={'User-Agent': 'Mozilla/5.0'})

webpage_rot = urlopen(req_rot).read()

page_soup_rot = soup(webpage_rot, "html.parser")

reviews_rot = page_soup_rot.findAll("div",{"class":"audience-reviews__review-wrap"})

z_rot = re.findall(r'js-clamp"(.+)</p>', str(reviews_rot))

Movie_Adj_rot = re.sub("[^\w]", " ",  str(z_rot)).split()

The better description for this issue is windowed pagination, the simplest solution I found was to just learn selenium and insert a scrape function within a ranged loop of clicking the next button element on each page.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM