简体   繁体   中英

Why does BeautifulSoup give me the wrong text?

I've been trying to get the availability status of a product on IKEA's website. On IKEA's website, it says in Dutch: 'not available for delivery', 'only available in the shop', 'not in stock' and 'you've got 365 days of warranty'.

But my code gives me: 'not available for delivery', 'only available for order and pickup', 'checking inventory' and 'you've got 365 days of warranty'.

What do I do wrong which causes the text to not be the same?

This is my code:

import requests
from bs4 import BeautifulSoup

# Get the url of the IKEA page and set up the bs4 stuff
url = 'https://www.ikea.com/nl/nl/p/flintan-bureaustoel-vissle-zwart-20336841/'
thepage = requests.get(url)
soup = BeautifulSoup(thepage.text, 'lxml')

# Locate the part where the availability stuff is
availabilitypanel = soup.find('div', {'class' : 'range-revamp-product-availability'})

# Get the text of the things inside of that panel
availabilitysectiontext = [part.getText() for part in availabilitypanel]
print(availabilitysectiontext)

With the help of Rajesh, I created this as the script that does exactly what I want. It goes to a certain shop (the one located in Heerlen) and it can check for any out of stock item when it comes back to stock and send you an email whenever it is back in stock.

The script used for this is:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import time
import smtplib, ssl

# Fill in the url of the product
url = 'https://www.ikea.com/nl/nl/p/vittsjo-stellingkast-zwartbruin-glas-20213312/'

op = webdriver.ChromeOptions()
op.add_argument('headless')
driver = webdriver.Chrome(options=op, executable_path='/Users/Jem/Downloads/chromedriver')

# Stuff for sending the email
port = 465
password = 'password'
sender_email = 'email'
receiver_email = 'email'
message = """\
        Subject: Product is back in stock!

        Sent with Python. """

# Keep looping until back in stock
while True:
    driver.get(url)

# Go to the location of the shop 
    btn = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="onetrust-accept-btn-handler"]')))
    btn.click()

    location = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="content"]/div/div/div/div[2]/div[3]/div/div[5]/div[3]/div/span[1]/div/span/a')))
    location.click()

    differentlocation = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="range-modal-mount-node"]/div/div[3]/div/div[2]/div/div[1]/div[2]/a')))
    differentlocation.click()

    searchbar = driver.find_element_by_xpath('//*[@id="change-store-input"]')
# In this part you can choose the location you want to check
    searchbar.send_keys('heerlen')

    heerlen = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="range-modal-mount-node"]/div/div[3]/div/div[2]/div/div[3]/div')))
    heerlen.click()

    selecteer = driver.find_element_by_xpath('//*[@id="range-modal-mount-node"]/div/div[3]/div/div[3]/button')
    selecteer.click()

    close = driver.find_element_by_xpath('//*[@id="range-modal-mount-node"]/div/div[3]/div/div[1]/button')
    close.click()

# After you went to the right page, beautifulsoup it
    source = driver.page_source

    soup = BeautifulSoup(source, 'lxml')

# Locate the part where the availability stuff is
    availabilitypanel = soup.find('div', {"class" : "range-revamp-product-availability"})

# Get the text of the things inside of that panel
    availabilitysectiontext = [part.getText() for part in availabilitypanel]

# Check whether it is still out of stock, if so wait half an hour and continue
    if 'Niet op voorraad in Heerlen' in availabilitysectiontext:
        time.sleep(1800)
        continue

# If not, send me an email that it is back in stock
    else:
        print('Email is being sent...')
        context = ssl.create_default_context()
        with smtplib.SMTP_SSL('smtp.gmail.com', port, context=context) as server:
            server.login(sender_email, password)
            server.sendmail(sender_email, receiver_email, message)
        break

The page markup is getting added with javascript after the initial server response. BeautifulSoup is only able to see the initial response and doesn't execute javascript to get the complete response. If you want to run JavaScript, you'll need to use a headless browser. Otherwise, you'll have to disassemble the JavaScript and see what it does.

You could get this to work with Selenium . I modified your code a bit and got it to work.

Get Selenium :

pip3 install selenium

Download Firefox + geckodriver or Chrome + chromedriver :

from bs4 import BeautifulSoup
import time
from selenium import webdriver

# Get the url of the IKEA page and set up the bs4 stuff
url = 'https://www.ikea.com/nl/nl/p/flintan-bureaustoel-vissle-zwart-20336841/'

#uncomment the following line if using firefox + geckodriver
#driver = webdriver.Firefox(executable_path='/Users/ralwar/Downloads/geckodriver') # Downloaded from https://github.com/mozilla/geckodriver/releases

# using chrome + chromedriver
op = webdriver.ChromeOptions()
op.add_argument('headless')
driver = webdriver.Chrome(options=op, executable_path='/Users/ralwar/Downloads/chromedriver') # Downloaded from https://chromedriver.chromium.org/downloads

driver.get(url)
time.sleep(5)   #adding delay to finish loading the page + javascript completely, you can adjust this
source = driver.page_source

soup = BeautifulSoup(source, 'lxml')

# Locate the part where the availability stuff is
availabilitypanel = soup.find('div', {"class" : "range-revamp-product-availability"})

# Get the text of the things inside of that panel
availabilitysectiontext = [part.getText() for part in availabilitypanel]
print(availabilitysectiontext)

The above code prints:

['Niet beschikbaar voor levering', 'Alleen beschikbaar in de winkel', 'Niet op voorraad in Amersfoort', 'Je hebt 365 dagen om van gedachten te veranderen. ']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM