Python 网页抓取 | 如何通过选择页码作为使用 Beautiful Soup 和 selenium 的范围从多个 url 中抓取数据？

Question

from selenium import webdriver
import time
from bs4 import BeautifulSoup as Soup
driver = webdriver.Firefox(executable_path='C://Downloads//webdrivers//geckodriver.exe')
a = 'https://www.amazon.com/s?k=Mobile&i=amazon-devices&page='
for c in range(8):

    #a = f'https://www.amazon.com/s?k=Mobile&i=amazon-devices&page={c}'

    cd = driver.get(a+str(c))

    page_source = driver.page_source
    bs = Soup(page_source, 'html.parser')

    fetch_data = bs.find_all('div', {'class': 's-expand-height.s-include-content-margin.s-latency-cf-section.s-border-bottom'})

    for f_data in fetch_data:
        product_name = f_data.find('span', {'class': 'a-size-medium.a-color-base.a-text-normal'})
        print(product_name + '\n')

Now The problem here is that, Webdriver successfully visits 7 pages, But doesn't provide any output or an error.现在这里的问题是，Webdriver 成功访问了 7 个页面，但没有提供任何输出或错误。

Now I don't know where M in going wrong.现在我不知道 M 哪里出错了。

Any suggestions, reference to a article that provides solution about this problem will be always welcomed.任何建议，参考提供有关此问题的解决方案的文章将始终受到欢迎。

Answer 1

You are not selecting the right div tag to fetch the products using BeautifulSoup, leading to no output.您没有选择正确的 div 标签来使用 BeautifulSoup 获取产品，导致没有输出。

Try the following snippet:-尝试以下代码段：-

#range of pages
for i in range(1,20):

    driver.get(f'https://www.amazon.com/s?k=Mobile&i=amazon-devices&page={i}')
    page_source = driver.page_source
    bs = Soup(page_source, 'html.parser')
    
    #get search results
    products=bs.find_all('div',{'data-component-type':"s-search-result"})

    #for each product in search result print product name
    for i in range(0,len(products)):
        for product_name in products[i].find('span',class_="a-size-medium a-color-base a-text-normal"):
            print(product_name)

Answer 2

You can print bs or fetch_data to debug.您可以打印 bs 或 fetch_data 进行调试。

Anyway反正

In my opinion, you can use requests or urllib to get page_source instead of selenium在我看来，您可以使用requests或urllib来获取 page_source 而不是 selenium

Python 网页抓取 | 如何通过选择页码作为使用 Beautiful Soup 和 selenium 的范围从多个 url 中抓取数据？

问题描述

2 个解决方案

解决方案1
1 已采纳 2021-09-21 14:58:00

解决方案2
0 2021-09-21 14:57:51

Python 网页抓取 | 如何通过选择页码作为使用 Beautiful Soup 和 selenium 的范围从多个 url 中抓取数据？

问题描述

2 个解决方案

解决方案1 1 已采纳 2021-09-21 14:58:00

解决方案2 0 2021-09-21 14:57:51

解决方案1
1 已采纳 2021-09-21 14:58:00

解决方案2
0 2021-09-21 14:57:51