[英]Scraping multiple urls from same website multiple pages
I developed this program to scrape the name of each product, price, and shipping cost of each ps4 on a page on newegg.com.我开发了这个程序来在 newegg.com 的页面上抓取每个 ps4 的每个产品的名称、价格和运费。 However, since there are multiple pages with ps4's on them how can I add multiple links to the source variable.
但是,由于有多个带有 ps4 的页面,我如何向源变量添加多个链接。 Basically, I want to scrape multiple pages on newegg.com( ex: the ps4 page #1, #2, #4, etc).
基本上,我想在 newegg.com 上抓取多个页面(例如:ps4 页面 #1、#2、#4 等)。
from bs4 import BeautifulSoup
import requests
import csv
source = requests.get('https://www.newegg.com/PS4-Systems/SubCategory/ID-3102').text
soup = BeautifulSoup(source, 'lxml')
csv_file = open('newegg_scrape.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Product', 'Price', 'Shipping_info'])
for info in soup.find_all('div', class_='item-container'):
prod = info.find('a', class_='item-title').text.strip()
price = info.find('li', class_='price-current').text.strip().splitlines()[1].replace(u'\xa0', '')
if u'$' not in price:
price = info.find('li', class_='price-current').text.strip().splitlines()[0].replace(u'\xa0', '')
ship = info.find('li', class_='price-ship').text.strip()
print(prod)
print(price)
print(ship)
csv_writer.writerow([prod, price, ship])
# print(price.splitlines()[1])
print('-----------')
csv_file.close()
I dont do PHP but I have used Perl in the past to perform screen scraping.我不使用 PHP,但我过去曾使用 Perl 来执行屏幕抓取。
If you notice down near the bottom of the page, there is a button bar for additional pages.如果您注意到页面底部附近有一个用于其他页面的按钮栏。 You will find the page 2 and additional urls to be of the format https://www.newegg.com/PS4-Systems/SubCategory/ID-3102/Page-2?PageSize=36&order=BESTMATCH
您会发现第 2 页和其他网址的格式为https://www.newegg.com/PS4-Systems/SubCategory/ID-3102/Page-2?PageSize=36&order=BESTMATCH
Simply make a loop to construct the URLs replace Page-2 with Page-3, 4 etc, query, scrape repeat.只需创建一个循环来构建 URL,将第 2 页替换为第 3 页、第 4 页等,查询、抓取重复。 I guess you just keep on going until you don't get a response or the page no longer has the information you are looking for.
我想你会一直继续下去,直到你没有得到回应或者页面不再有你正在寻找的信息。
Grab the number of pages (from the first page scraped) based on its selector, then iterate over that while including the page number in the source.根据其选择器获取页数(从抓取的第一页开始),然后迭代该页数,同时在源中包含页码。
'https://www.newegg.com/PS4-Systems/SubCategory/ID-3102'
'https://www.newegg.com/PS4-Systems/SubCategory/ID-3102'
soup.find('div', class_='list-tool-pagination').find('strong').text.split('/')[1]
soup.find('div', class_='list-tool-pagination').find('strong').text.split('/')[1]
'https://www.newegg.com/PS4-Systems/SubCategory/ID-3102/Page-' + page_number
'https://www.newegg.com/PS4-Systems/SubCategory/ID-3102/Page-' + page_number
from bs4 import BeautifulSoup
import requests
import csv
base_url = 'https://www.newegg.com/PS4-Systems/SubCategory/ID-3102'
# Grab the number of pages
def get_pages_number(soup):
pages_number = soup.find('div', class_='list-tool-pagination').find('strong').text.split('/')[1]
return int(pages_number)
# Your code + dynamic URL + return number of pages
def scrape_page(page_number=1):
# Make the source "dynamic" based on the page number
source = requests.get(f'{base_url}/Page-{page_number}').text
soup = BeautifulSoup(source, 'lxml')
# Soup processing goes here
# You can use the code you posted to grab the price, etc...
return get_pages_number(soup)
# Main function
if __name__ == '__main__':
pages_number = scrape_page()
# If there are more pages, we scrape them
if pages_number > 1:
for i in range(1, pages_number):
scrape_page(i + 1)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.