简体   繁体   中英

Scraping multiple urls from same website multiple pages

I developed this program to scrape the name of each product, price, and shipping cost of each ps4 on a page on newegg.com. However, since there are multiple pages with ps4's on them how can I add multiple links to the source variable. Basically, I want to scrape multiple pages on newegg.com( ex: the ps4 page #1, #2, #4, etc).

from bs4 import BeautifulSoup
import requests
import csv

source = requests.get('https://www.newegg.com/PS4-Systems/SubCategory/ID-3102').text

soup = BeautifulSoup(source, 'lxml')

csv_file = open('newegg_scrape.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Product', 'Price', 'Shipping_info'])


for info in soup.find_all('div', class_='item-container'):
    prod = info.find('a', class_='item-title').text.strip()
    price = info.find('li', class_='price-current').text.strip().splitlines()[1].replace(u'\xa0', '')
    if  u'$' not in price:
        price = info.find('li', class_='price-current').text.strip().splitlines()[0].replace(u'\xa0', '')
    ship = info.find('li', class_='price-ship').text.strip()
    print(prod)
    print(price)
    print(ship)
    csv_writer.writerow([prod, price, ship])

   # print(price.splitlines()[1])
    print('-----------')
csv_file.close()

I dont do PHP but I have used Perl in the past to perform screen scraping.

If you notice down near the bottom of the page, there is a button bar for additional pages. You will find the page 2 and additional urls to be of the format https://www.newegg.com/PS4-Systems/SubCategory/ID-3102/Page-2?PageSize=36&order=BESTMATCH

Simply make a loop to construct the URLs replace Page-2 with Page-3, 4 etc, query, scrape repeat. I guess you just keep on going until you don't get a response or the page no longer has the information you are looking for.

TL;DR

Grab the number of pages (from the first page scraped) based on its selector, then iterate over that while including the page number in the source.

Explanation

  1. Visit the first page 'https://www.newegg.com/PS4-Systems/SubCategory/ID-3102'
  2. Grab the items in the page (what your code does already)
  3. Grab the number of pages from that page with its selector. Like this soup.find('div', class_='list-tool-pagination').find('strong').text.split('/')[1]
  4. Return that number at the end
  5. If it's more than 1, iterate over the rest of the pages. For each iteration the source becomes 'https://www.newegg.com/PS4-Systems/SubCategory/ID-3102/Page-' + page_number

Code

from bs4 import BeautifulSoup
import requests
import csv

base_url = 'https://www.newegg.com/PS4-Systems/SubCategory/ID-3102'

# Grab the number of pages
def get_pages_number(soup):
    pages_number = soup.find('div', class_='list-tool-pagination').find('strong').text.split('/')[1]
    return int(pages_number)

# Your code + dynamic URL + return number of pages
def scrape_page(page_number=1):
    # Make the source "dynamic" based on the page number
    source = requests.get(f'{base_url}/Page-{page_number}').text
    soup = BeautifulSoup(source, 'lxml')

    # Soup processing goes here
    # You can use the code you posted to grab the price, etc...

    return get_pages_number(soup)

# Main function
if __name__ == '__main__':
    pages_number = scrape_page()

    # If there are more pages, we scrape them
    if pages_number > 1:
        for i in range(1, pages_number):
            scrape_page(i + 1)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM