简体   繁体   中英

Selenium/BeautifulSoup - Python - Loop Through Multiple Pages

I have spent the better part of my day researching and testing the best way to loop through a set of products on a retailer's website.

While I am successfully able to collect the set of products (and attributes) on the first page, I have been stumped on figuring out the best way to loop through the pages of the site to continue my scrape.

Per my code below, I have attempted to use a 'while' loop and Selenium to click on the 'next page' button of the website and then continue to collect products.

The issue is that my code still doesn't get past page 1.

Am I making a silly error here? Read 4 or 5 similar examples on this site, but none were specific enough to provide the solve here.

from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get('https://www.kohls.com/catalog/mens-button-down-shirts-tops-clothing.jsp?CN=Gender:Mens+Silhouette:Button-Down%20Shirts+Category:Tops+Department:Clothing&cc=mens-TN3.0-S-buttondownshirts&kls_sbp=43160314801019132980443403449632772558&PPP=120&WS=0')

products.clear()
hyperlinks.clear()
reviewCounts.clear()
starRatings.clear()

products = []
hyperlinks = []
reviewCounts = []
starRatings = []

pageCounter = 0
maxPageCount = int(html_soup.find('a', class_ = 'totalPageNum').text)+1


html_soup = BeautifulSoup(driver.page_source, 'html.parser')
prod_containers = html_soup.find_all('li', class_ = 'products_grid')


while (pageCounter < maxPageCount):
    for product in prod_containers:
        # If the product has review count, then extract:
        if product.find('span', class_ = 'prod_ratingCount') is not None:
            # The product name
            name = product.find('div', class_ = 'prod_nameBlock')
            name = re.sub(r"\s+", " ", name.text)
            products.append(name)

            # The product hyperlink
            hyperlink = product.find('span', class_ = 'prod_ratingCount')
            hyperlink = hyperlink.a
            hyperlink = hyperlink.get('href')
            hyperlinks.append(hyperlink)

            # The product review count
            reviewCount = product.find('span', class_ = 'prod_ratingCount').a.text
            reviewCounts.append(reviewCount)

            # The product overall star ratings
            starRating = product.find('span', class_ = 'prod_ratingCount')
            starRating = starRating.a
            starRating = starRating.get('alt')
            starRatings.append(starRating) 

    driver.find_element_by_xpath('//*[@id="page-navigation-top"]/a[2]').click()
    counterProduct +=1
    print(counterProduct)

You need to parse each time you "click" on next page. So you'll want to have that included within your while loop, otherwise you're just going to continue to iterate over the the 1st page, even when it clicks to the next page, because the prod_containers object never changes.

Secondly, the way you have it, your while loop will never stop because you set pageCounter = 0, but never increment it...it will forever be < your maxPageCount.

I fixed those 2 things in the code and ran it, and it appears to have worked and parsed pages 1 through 5.

from selenium import webdriver
from bs4 import BeautifulSoup
import re

driver = webdriver.Chrome()
driver.get('https://www.kohls.com/catalog/mens-button-down-shirts-tops-clothing.jsp?CN=Gender:Mens+Silhouette:Button-Down%20Shirts+Category:Tops+Department:Clothing&cc=mens-TN3.0-S-buttondownshirts&kls_sbp=43160314801019132980443403449632772558&PPP=120&WS=0')

products = []
hyperlinks = []
reviewCounts = []
starRatings = []

pageCounter = 0

html_soup = BeautifulSoup(driver.page_source, 'html.parser')
maxPageCount = int(html_soup.find('a', class_ = 'totalPageNum').text)+1

prod_containers = html_soup.find_all('li', class_ = 'products_grid')


while (pageCounter < maxPageCount):
    html_soup = BeautifulSoup(driver.page_source, 'html.parser')
    prod_containers = html_soup.find_all('li', class_ = 'products_grid')
    for product in prod_containers:
        # If the product has review count, then extract:
        if product.find('span', class_ = 'prod_ratingCount') is not None:
            # The product name
            name = product.find('div', class_ = 'prod_nameBlock')
            name = re.sub(r"\s+", " ", name.text)
            name = name.strip()
            products.append(name)

            # The product hyperlink
            hyperlink = product.find('span', class_ = 'prod_ratingCount')
            hyperlink = hyperlink.a
            hyperlink = hyperlink.get('href')
            hyperlinks.append(hyperlink)

            # The product review count
            reviewCount = product.find('span', class_ = 'prod_ratingCount').a.text
            reviewCounts.append(reviewCount)

            # The product overall star ratings
            starRating = product.find('span', class_ = 'prod_ratingCount')
            starRating = starRating.a
            starRating = starRating.get('alt')
            starRatings.append(starRating) 

    driver.find_element_by_xpath('//*[@id="page-navigation-top"]/a[2]').click()
    pageCounter +=1
    print(pageCounter)

Ok, this snippet of code will not run when run alone from a .py file, I'm guessing you were running it in iPython or a similar environment and had these vars already initialized and libraries imported.

First off, you need to include the regex package:

import re

Also, all those clear() statements are not necessary, since you initialize all those lists anyway (actually python throws an error anyway, because those lists haven't been defined yet when you call clear on them)

Also you needed to initialize counterProduct :

counterProduct = 0

and finally you have to set a value to your html_soup before referencing it in your code:

html_soup = BeautifulSoup(driver.page_source, 'html.parser')

here is the corrected code, which is working:

from selenium import webdriver
from bs4 import BeautifulSoup
import re

driver = webdriver.Chrome()
driver.get('https://www.kohls.com/catalog/mens-button-down-shirts-tops-clothing.jsp?CN=Gender:Mens+Silhouette:Button-Down%20Shirts+Category:Tops+Department:Clothing&cc=mens-TN3.0-S-buttondownshirts&kls_sbp=43160314801019132980443403449632772558&PPP=120&WS=0')

products = []
hyperlinks = []
reviewCounts = []
starRatings = []

pageCounter = 0
html_soup = BeautifulSoup(driver.page_source, 'html.parser')
maxPageCount = int(html_soup.find('a', class_ = 'totalPageNum').text)+1
prod_containers = html_soup.find_all('li', class_ = 'products_grid')
counterProduct = 0
while (pageCounter < maxPageCount):
    for product in prod_containers:
        # If the product has review count, then extract:
        if product.find('span', class_ = 'prod_ratingCount') is not None:
            # The product name
            name = product.find('div', class_ = 'prod_nameBlock')
            name = re.sub(r"\s+", " ", name.text)
            products.append(name)

            # The product hyperlink
            hyperlink = product.find('span', class_ = 'prod_ratingCount')
            hyperlink = hyperlink.a
            hyperlink = hyperlink.get('href')
            hyperlinks.append(hyperlink)

            # The product review count
            reviewCount = product.find('span', class_ = 'prod_ratingCount').a.text
            reviewCounts.append(reviewCount)

            # The product overall star ratings
            starRating = product.find('span', class_ = 'prod_ratingCount')
            starRating = starRating.a
            starRating = starRating.get('alt')
            starRatings.append(starRating) 

    driver.find_element_by_xpath('//*[@id="page-navigation-top"]/a[2]').click()
    counterProduct +=1
    print(counterProduct)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM