简体   繁体   中英

Webscrape multiple webpages from a website using beautifulsoup, requests in Python

I want to srcrape multiple wb pages on a website. Right now my code can scrape reviews from the 1st page. I would like it to scrape reviews from the related pages. In this example till page 8. This is the link of the website https://www.mouthshut.com/product-reviews/Kotak-811-Mobile-Banking-reviews-925917218

import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv

URL = "https://www.mouthshut.com/product-reviews/Kotak-811-Mobile-Banking-reviews-925917218"
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')
reviews = []  # a list to store reviews

# Use a CSS selector to extract all the review containers
review_divs = soup.select('div.col-10.review')
for element in review_divs :
    review = {'Review_Title': element .a.text, 'URL': element .a['href'], 'Review': element .find('div', {'class': ['more', 'reviewdata']}).text.strip()}
    reviews.append(review)

df = pd.DataFrame(reviews)
print(df)

I want to store all reviews from 8 pages in one dataframe df. I would appreciate the help. Thank You

Switch to the next page after scraping all reviews from the first page and do the same until you got all reviews. Just make your program click on the "nextpage" arrow at the bottom to proceed.

So this is the first page https://www.mouthshut.com/product-reviews/Kotak-811-Mobile-Banking-reviews-925917218 , right. The rest of the pages have -page-x at the end of the url. So you can just make a for loop in your script, like this.

import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv

URL = ""

for x in range(1, 9):
    if x == 1:
        URL = "https://www.mouthshut.com/product-reviews/Kotak-811-Mobile-Banking-reviews-925917218"
    else:
        URL ="https://www.mouthshut.com/product-reviews/Kotak-811-Mobile-Banking-reviews-925917218-page-{}".format(x)

    r = requests.get(URL)
    soup = BeautifulSoup(r.content, 'html5lib')
    reviews = []  # a list to store reviews

    # Use a CSS selector to extract all the review containers
    review_divs = soup.select('div.col-10.review')
    for element in review_divs :
        review = {'Review_Title': element .a.text, 'URL': element .a['href'], 'Review': element .find('div', {'class': ['more', 'reviewdata']}).text.strip()}
        reviews.append(review)

    df = pd.DataFrame(reviews)
    print(df)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM