When using beautiful soup to webscrape reviews I have an issue when it comes to "All Audience" Reviews. The URL doesn't update when changing review list pages.
Here is an example: https://www.rottentomatoes.com/m/midsommar/reviews?type=user
No change in the URL is made when clicking next.
Based on some of the other answers available in another thread I tried (and I might be saying this wrong) tracking xhr request, I believe the exact script that is being run is what I have highlighted in the picture here(I don't have 10 reputation so can't post image).
When I look into the header of that GET action I see a Request URL, and when I try that it has all of the info I need, the problem is I don't know their naming convention for going to the next page. Below is how the RequestURLs change between pages.
How can I get beautiful soup to iterate over these?
Thanks!
Below should be enough code to get by attempting this, ignore some of the naming.
from bs4 import BeautifulSoup as soup
from urllib.request import Request, urlopen
x = input('What Movie?').replace(" ", "_").lower()
req_rot = Request('https://www.rottentomatoes.com/m/' + str(x) + '/reviews?type=user', headers={'User-Agent': 'Mozilla/5.0'})
webpage_rot = urlopen(req_rot).read()
page_soup_rot = soup(webpage_rot, "html.parser")
reviews_rot = page_soup_rot.findAll("div",{"class":"audience-reviews__review-wrap"})
z_rot = re.findall(r'js-clamp"(.+)</p>', str(reviews_rot))
Movie_Adj_rot = re.sub("[^\w]", " ", str(z_rot)).split()
The better description for this issue is windowed pagination, the simplest solution I found was to just learn selenium and insert a scrape function within a ranged loop of clicking the next button element on each page.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.