简体   繁体   English

BeautifulSoup find_all function 在 main 内部不起作用

[英]BeautifulSoup find_all function doesn't work inside of main

I'm trying to scrap on the website conforama and to do so, I'm using BeautifulSoup. I'm trying to retrieve the price, the description, the rate, the url and the number of reviews of the item and to do so recursively on 3 pages.我正在尝试在 conforama 网站上删除并这样做,我正在使用 BeautifulSoup。我正在尝试检索价格、描述、费率、url 和该项目的评论数量并这样做递归在 3 页上。

At first, I import the required librairies首先,我导入所需的库

import csv
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager

I define a first function: get_url that will format correctly the url with a specific search_term and return a url that's waiting to be formated with the right page number我定义了第一个 function: get_url,它将使用特定的 search_term 正确格式化 url,并返回等待使用正确页码格式化的 url

def get_url(search_term):
    template = 'https://www.conforama.fr/recherche-conforama/{}'
    
    search_term = search_term.replace(' ','+')
    
    url = template.format(search_term)
    
    url+= '?P1-PRODUCTS%5Bpage%5D={}'
    
    return url

I define a second one to get rid of some content that makes the data unreadable我定义了第二个来去掉一些使数据不可读的内容

def format_number(number):
    new_number = ''
    for n in number:
        if n not in '0123456789€,.' : return new_number
        new_number+=n

I define a third function that will take a record and extract all the information that I need from it: its price, description, url, rating and number of reviews.我定义了第三个 function,它将记录并从中提取我需要的所有信息:它的价格、描述、url、评级和评论数量。

def extract_record(item):
    print(item)
    descriptions = item.find_all("a", {"class" : "bindEvent"})

    description = descriptions[1].text.strip() + ' ' + descriptions[2].text.strip()

    #get url of product
    url = descriptions[2]['href']
    print(url)

    #number of reviews
    nor = descriptions[3].text.strip()
    nor = format_number(nor)

    #rating
    try:
        ratings = item.find_all("span", {"class" : "stars"})
        rating = ratings[0]['data']
    except AttributeError:
        return

    #price
    try:
        prices = item.find_all("div", {"class" : "price-product"})
        price = prices[0].text.strip()
    except AttributeError:
        return
    price = format_number(price)
    
    return (description, price, rating, nor, url)

In the end, I gather all the functions inside a main function that will allow me to iterate over all the pages I need to extract from最后,我将所有函数收集在一个 main function 中,这将允许我迭代我需要从中提取的所有页面

def main(search_term):
    #product_name = search_term
    
    driver = webdriver.Chrome(ChromeDriverManager().install())
    records = []
    url = get_url(search_term)
    somme = 0
    for page in range (1,4):
       driver.get(url.format(page))
       soup = BeautifulSoup(driver.page_source, 'html.parser')
       print('longueur soup', len(soup))
       print(soup)
       results = soup.find_all('li', {'class' : 'ais-Hits-item box-product fragItem'})
       print(len(results))
       somme+=len(results)
       for result in results:
           record = extract_record(result)
           if record:
               print(record)
               records.append(record)
    driver.close()
    print('somme',somme)

Now the problem is that when I run all the commands one by one:现在的问题是,当我一一运行所有命令时:

driver = webdriver.Chrome(ChromeDriverManager().install())
url = get_url('couch').format(1)
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
results = soup.find_all('li', {'class' : 'ais-Hits-item box-product fragItem'})
item = results[0]
extracted = extract_record(item)

everything is great and the extract_record function returns exactly what I need it to.一切都很好,extract_record function 返回的正是我所需要的。 However, when I run the main function, this row of code:但是,当我运行 main function 时,这行代码:

results = soup.find_all('li', {'class' : 'ais-Hits-item box-product fragItem'})

does not return any result even though I know it does when I execute it outside of the main function不返回任何结果,即使我知道当我在主 function 之外执行它时它会返回任何结果

Has anyone had the same problem and do you have any idea of what I do wrong and how to fix it?有没有人有同样的问题,你知道我做错了什么以及如何解决吗? Thanks a lot for reading and trying to answer非常感谢阅读和尝试回答

What happens?怎么了?

Main issue is that the elements need some time to be generated / displayed and they are not available in the moment you grab the driver.page_source .主要问题是元素需要一些时间才能生成/显示,并且在您获取driver.page_source时它们不可用。

How to fix?怎么修?

Use seleniums waits until presence of specific elements are located:使用seleniums 等待直到找到特定元素:

wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 'li.ais-Hits-item.box-product.fragItem div.price-product')))
soup = BeautifulSoup(driver.page_source, 'html.parser')
results = soup.find_all('li', {'class' : 'ais-Hits-item box-product fragItem'})

Example例子

...
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

...

def main(search_term):
    #product_name = search_term
    
    driver = webdriver.Chrome(ChromeDriverManager().install())
    records = []
    url = get_url(search_term)
    somme = 0
    for page in range (1,4):
        driver.get(url.format(page))
        print(url.format(page))
        wait = WebDriverWait(driver, 10)
        wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 'li.ais-Hits-item.box-product.fragItem div.price-product')))
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        results = soup.find_all('li', {'class' : 'ais-Hits-item box-product fragItem'})
        somme+=len(results)
        for result in results:
            record = extract_record(result)
            if record:
                print(record)
                records.append(record)
    driver.close()
    print('somme',somme)

main('matelas')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM