简体   繁体   中英

Scraping from specific website has stopped working

So a couple of weeks ago I wrote this program which sucessfuly scraped some info on some online store, but now it has stopped working without me changing the code?

Could this be something that has been changed within the website itself or is there something wrong with my code?

import requests
from bs4 import BeautifulSoup

url = 'https://www.continente.pt/stores/continente/pt-pt/public/Pages/ProductDetail.aspx?ProductId=7104665(eCsf_RetekProductCatalog_MegastoreContinenteOnline_Continente)'

res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')

priceInfo = soup.find('div', class_='pricePerUnit').text

priceInfo = priceInfo.replace('\n', '').replace('\r', '').replace(' ', '')

productName = soup.find('div', class_='productTitle').text.replace('\n', ' ')

productInfo = (soup.find('div', class_='productSubtitle').text
               + ', ' + soup.find('div', class_='productSubsubtitle').text)

print('Nome do produto: ' + productName)
print('Detalhes: ' + productInfo)
print('Custo: ' + priceInfo)

I know for a fact that what im searching for does exist and the url is still valid, so what could be the issue? I separated the priceInfo into 2 lines because the error exists in the first declaration, since it returns a NoneType which has no text attribute

Solution is bit multistep.

  1. Try calling the page you want to scrape in Firefox once
  2. Use browser_cookie3 lib to extract cookies
  3. ensure they are not expired
  4. Use the cookies in requests.get(url, cookies=browser_cookie3.firefox())
  5. Use the headers as below

Hope it works!! Happy scraping

Have tried on my own and it works!!

 headers = {
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Sec-Fetch-Site': 'none',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-User': '?1',
    'Sec-Fetch-Dest': 'document',
    'Accept-Language': 'en-US,en;q=0.9,de;q=0.8',
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM