简体   繁体   English

从同一网站的多个页面抓取多个网址

[英]Scraping multiple urls from same website multiple pages

I developed this program to scrape the name of each product, price, and shipping cost of each ps4 on a page on newegg.com.我开发了这个程序来在 newegg.com 的页面上抓取每个 ps4 的每个产品的名称、价格和运费。 However, since there are multiple pages with ps4's on them how can I add multiple links to the source variable.但是,由于有多个带有 ps4 的页面,我如何向源变量添加多个链接。 Basically, I want to scrape multiple pages on newegg.com( ex: the ps4 page #1, #2, #4, etc).基本上,我想在 newegg.com 上抓取多个页面(例如:ps4 页面 #1、#2、#4 等)。

from bs4 import BeautifulSoup
import requests
import csv

source = requests.get('https://www.newegg.com/PS4-Systems/SubCategory/ID-3102').text

soup = BeautifulSoup(source, 'lxml')

csv_file = open('newegg_scrape.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Product', 'Price', 'Shipping_info'])


for info in soup.find_all('div', class_='item-container'):
    prod = info.find('a', class_='item-title').text.strip()
    price = info.find('li', class_='price-current').text.strip().splitlines()[1].replace(u'\xa0', '')
    if  u'$' not in price:
        price = info.find('li', class_='price-current').text.strip().splitlines()[0].replace(u'\xa0', '')
    ship = info.find('li', class_='price-ship').text.strip()
    print(prod)
    print(price)
    print(ship)
    csv_writer.writerow([prod, price, ship])

   # print(price.splitlines()[1])
    print('-----------')
csv_file.close()

I dont do PHP but I have used Perl in the past to perform screen scraping.我不使用 PHP,但我过去曾使用 Perl 来执行屏幕抓取。

If you notice down near the bottom of the page, there is a button bar for additional pages.如果您注意到页面底部附近有一个用于其他页面的按钮栏。 You will find the page 2 and additional urls to be of the format https://www.newegg.com/PS4-Systems/SubCategory/ID-3102/Page-2?PageSize=36&order=BESTMATCH您会发现第 2 页和其他网址的格式为https://www.newegg.com/PS4-Systems/SubCategory/ID-3102/Page-2?PageSize=36&order=BESTMATCH

Simply make a loop to construct the URLs replace Page-2 with Page-3, 4 etc, query, scrape repeat.只需创建一个循环来构建 URL,将第 2 页替换为第 3 页、第 4 页等,查询、抓取重复。 I guess you just keep on going until you don't get a response or the page no longer has the information you are looking for.我想你会一直继续下去,直到你没有得到回应或者页面不再有你正在寻找的信息。

TL;DR TL; 博士

Grab the number of pages (from the first page scraped) based on its selector, then iterate over that while including the page number in the source.根据其选择器获取页数(从抓取的第一页开始),然后迭代该页数,同时在源中包含页码。

Explanation解释

  1. Visit the first page 'https://www.newegg.com/PS4-Systems/SubCategory/ID-3102'访问第一页'https://www.newegg.com/PS4-Systems/SubCategory/ID-3102'
  2. Grab the items in the page (what your code does already)抓取页面中的项目(你的代码已经做了什么)
  3. Grab the number of pages from that page with its selector.使用其选择器从该页面中获取页面数。 Like this soup.find('div', class_='list-tool-pagination').find('strong').text.split('/')[1]像这个soup.find('div', class_='list-tool-pagination').find('strong').text.split('/')[1]
  4. Return that number at the end最后返回那个数字
  5. If it's more than 1, iterate over the rest of the pages.如果大于 1,则遍历其余页面。 For each iteration the source becomes 'https://www.newegg.com/PS4-Systems/SubCategory/ID-3102/Page-' + page_number对于每次迭代,源变为'https://www.newegg.com/PS4-Systems/SubCategory/ID-3102/Page-' + page_number

Code代码

from bs4 import BeautifulSoup
import requests
import csv

base_url = 'https://www.newegg.com/PS4-Systems/SubCategory/ID-3102'

# Grab the number of pages
def get_pages_number(soup):
    pages_number = soup.find('div', class_='list-tool-pagination').find('strong').text.split('/')[1]
    return int(pages_number)

# Your code + dynamic URL + return number of pages
def scrape_page(page_number=1):
    # Make the source "dynamic" based on the page number
    source = requests.get(f'{base_url}/Page-{page_number}').text
    soup = BeautifulSoup(source, 'lxml')

    # Soup processing goes here
    # You can use the code you posted to grab the price, etc...

    return get_pages_number(soup)

# Main function
if __name__ == '__main__':
    pages_number = scrape_page()

    # If there are more pages, we scrape them
    if pages_number > 1:
        for i in range(1, pages_number):
            scrape_page(i + 1)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM