从同一网站的多个页面抓取多个网址

Question

我开发了这个程序来在 newegg.com 的页面上抓取每个 ps4 的每个产品的名称、价格和运费。 但是，由于有多个带有 ps4 的页面，我如何向源变量添加多个链接。 基本上，我想在 newegg.com 上抓取多个页面（例如：ps4 页面 #1、#2、#4 等）。

from bs4 import BeautifulSoup
import requests
import csv

source = requests.get('https://www.newegg.com/PS4-Systems/SubCategory/ID-3102').text

soup = BeautifulSoup(source, 'lxml')

csv_file = open('newegg_scrape.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Product', 'Price', 'Shipping_info'])


for info in soup.find_all('div', class_='item-container'):
    prod = info.find('a', class_='item-title').text.strip()
    price = info.find('li', class_='price-current').text.strip().splitlines()[1].replace(u'\xa0', '')
    if  u'$' not in price:
        price = info.find('li', class_='price-current').text.strip().splitlines()[0].replace(u'\xa0', '')
    ship = info.find('li', class_='price-ship').text.strip()
    print(prod)
    print(price)
    print(ship)
    csv_writer.writerow([prod, price, ship])

   # print(price.splitlines()[1])
    print('-----------')
csv_file.close()

Answer 1

我不使用 PHP，但我过去曾使用 Perl 来执行屏幕抓取。

如果您注意到页面底部附近有一个用于其他页面的按钮栏。 您会发现第 2 页和其他网址的格式为https://www.newegg.com/PS4-Systems/SubCategory/ID-3102/Page-2?PageSize=36&order=BESTMATCH

只需创建一个循环来构建 URL，将第 2 页替换为第 3 页、第 4 页等，查询、抓取重复。 我想你会一直继续下去，直到你没有得到回应或者页面不再有你正在寻找的信息。

Answer 2

TL; 博士

根据其选择器获取页数（从抓取的第一页开始），然后迭代该页数，同时在源中包含页码。

解释

访问第一页'https://www.newegg.com/PS4-Systems/SubCategory/ID-3102'
抓取页面中的项目（你的代码已经做了什么）
使用其选择器从该页面中获取页面数。 像这个soup.find('div', class_='list-tool-pagination').find('strong').text.split('/')[1]
最后返回那个数字
如果大于 1，则遍历其余页面。 对于每次迭代，源变为'https://www.newegg.com/PS4-Systems/SubCategory/ID-3102/Page-' + page_number

代码

from bs4 import BeautifulSoup
import requests
import csv

base_url = 'https://www.newegg.com/PS4-Systems/SubCategory/ID-3102'

# Grab the number of pages
def get_pages_number(soup):
    pages_number = soup.find('div', class_='list-tool-pagination').find('strong').text.split('/')[1]
    return int(pages_number)

# Your code + dynamic URL + return number of pages
def scrape_page(page_number=1):
    # Make the source "dynamic" based on the page number
    source = requests.get(f'{base_url}/Page-{page_number}').text
    soup = BeautifulSoup(source, 'lxml')

    # Soup processing goes here
    # You can use the code you posted to grab the price, etc...

    return get_pages_number(soup)

# Main function
if __name__ == '__main__':
    pages_number = scrape_page()

    # If there are more pages, we scrape them
    if pages_number > 1:
        for i in range(1, pages_number):
            scrape_page(i + 1)

从同一网站的多个页面抓取多个网址

问题描述

2 个解决方案

解决方案1
0 2019-01-05 06:16:41

解决方案2
0 2019-01-05 12:03:51

TL; 博士

解释

代码

从同一网站的多个页面抓取多个网址

问题描述

2 个解决方案

解决方案1 0 2019-01-05 06:16:41

解决方案2 0 2019-01-05 12:03:51

TL; 博士

解释

代码

解决方案1
0 2019-01-05 06:16:41

解决方案2
0 2019-01-05 12:03:51