I developed this program to scrape the name of each product, price, and shipping cost of each ps4 on a page on newegg.com. However, since there are multiple pages with ps4's on them how can I add multiple links to the source variable. Basically, I want to scrape multiple pages on newegg.com( ex: the ps4 page #1, #2, #4, etc).
from bs4 import BeautifulSoup
import requests
import csv
source = requests.get('https://www.newegg.com/PS4-Systems/SubCategory/ID-3102').text
soup = BeautifulSoup(source, 'lxml')
csv_file = open('newegg_scrape.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Product', 'Price', 'Shipping_info'])
for info in soup.find_all('div', class_='item-container'):
prod = info.find('a', class_='item-title').text.strip()
price = info.find('li', class_='price-current').text.strip().splitlines()[1].replace(u'\xa0', '')
if u'$' not in price:
price = info.find('li', class_='price-current').text.strip().splitlines()[0].replace(u'\xa0', '')
ship = info.find('li', class_='price-ship').text.strip()
print(prod)
print(price)
print(ship)
csv_writer.writerow([prod, price, ship])
# print(price.splitlines()[1])
print('-----------')
csv_file.close()
I dont do PHP but I have used Perl in the past to perform screen scraping.
If you notice down near the bottom of the page, there is a button bar for additional pages. You will find the page 2 and additional urls to be of the format https://www.newegg.com/PS4-Systems/SubCategory/ID-3102/Page-2?PageSize=36&order=BESTMATCH
Simply make a loop to construct the URLs replace Page-2 with Page-3, 4 etc, query, scrape repeat. I guess you just keep on going until you don't get a response or the page no longer has the information you are looking for.
Grab the number of pages (from the first page scraped) based on its selector, then iterate over that while including the page number in the source.
'https://www.newegg.com/PS4-Systems/SubCategory/ID-3102'
soup.find('div', class_='list-tool-pagination').find('strong').text.split('/')[1]
'https://www.newegg.com/PS4-Systems/SubCategory/ID-3102/Page-' + page_number
from bs4 import BeautifulSoup
import requests
import csv
base_url = 'https://www.newegg.com/PS4-Systems/SubCategory/ID-3102'
# Grab the number of pages
def get_pages_number(soup):
pages_number = soup.find('div', class_='list-tool-pagination').find('strong').text.split('/')[1]
return int(pages_number)
# Your code + dynamic URL + return number of pages
def scrape_page(page_number=1):
# Make the source "dynamic" based on the page number
source = requests.get(f'{base_url}/Page-{page_number}').text
soup = BeautifulSoup(source, 'lxml')
# Soup processing goes here
# You can use the code you posted to grab the price, etc...
return get_pages_number(soup)
# Main function
if __name__ == '__main__':
pages_number = scrape_page()
# If there are more pages, we scrape them
if pages_number > 1:
for i in range(1, pages_number):
scrape_page(i + 1)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.