简体   繁体   English

Python 网页抓取 | 如何通过选择页码作为使用 Beautiful Soup 和 selenium 的范围从多个 url 中抓取数据?

[英]Python Web Scraping | How to scrape data from multiple urls by choosing page number as a range with Beautiful Soup and selenium?

from selenium import webdriver
import time
from bs4 import BeautifulSoup as Soup
driver = webdriver.Firefox(executable_path='C://Downloads//webdrivers//geckodriver.exe')
a = 'https://www.amazon.com/s?k=Mobile&i=amazon-devices&page='
for c in range(8):

    #a = f'https://www.amazon.com/s?k=Mobile&i=amazon-devices&page={c}'

    cd = driver.get(a+str(c))

    page_source = driver.page_source
    bs = Soup(page_source, 'html.parser')

    fetch_data = bs.find_all('div', {'class': 's-expand-height.s-include-content-margin.s-latency-cf-section.s-border-bottom'})

    for f_data in fetch_data:
        product_name = f_data.find('span', {'class': 'a-size-medium.a-color-base.a-text-normal'})
        print(product_name + '\n')

Now The problem here is that, Webdriver successfully visits 7 pages, But doesn't provide any output or an error.现在这里的问题是,Webdriver 成功访问了 7 个页面,但没有提供任何输出或错误。

Now I don't know where M in going wrong.现在我不知道 M 哪里出错了。

Any suggestions, reference to a article that provides solution about this problem will be always welcomed.任何建议,参考提供有关此问题的解决方案的文章将始终受到欢迎。

You are not selecting the right div tag to fetch the products using BeautifulSoup, leading to no output.您没有选择正确的 div 标签来使用 BeautifulSoup 获取产品,导致没有输出。

Try the following snippet:-尝试以下代码段:-

#range of pages
for i in range(1,20):

    driver.get(f'https://www.amazon.com/s?k=Mobile&i=amazon-devices&page={i}')
    page_source = driver.page_source
    bs = Soup(page_source, 'html.parser')
    
    #get search results
    products=bs.find_all('div',{'data-component-type':"s-search-result"})

    #for each product in search result print product name
    for i in range(0,len(products)):
        for product_name in products[i].find('span',class_="a-size-medium a-color-base a-text-normal"):
            print(product_name)

You can print bs or fetch_data to debug.您可以打印 bs 或 fetch_data 进行调试。

Anyway反正

In my opinion, you can use requests or urllib to get page_source instead of selenium在我看来,您可以使用requestsurllib来获取 page_source 而不是 selenium

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM