简体   繁体   中英

scrape with beautifulsoup the <a> not present in the result

hello I try to scrape the url from a page to iterate in the catalogue and get all information like title description etc... I did this many times but on this site it seems the information is block somewhere this is what I'm doing

import requests
import pandas as pd
from bs4 import BeautifulSoup



url='https://www.stihl.fr/fr/p/tronconneuses-ms-180-54057#ms-180-54057'
headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4)        AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"}  

r = requests.get(url, headers=headers)

soup =BeautifulSoup(r.content, 'lxml')
link = soup.find('a',class_='m_category-overview-tiles__item tile_product-standard')
print(link)

the html code look like this

<div class="m_category-overview-tiles__products animated faster fadeIn"><a href="/fr  /p/tronconneuses-ms-180-54057#ms-180-54057" class="m_category-overview-tiles__item tile_product-standard" data-test-id="product-tile-link"><div class="tile_product-standard__wrapper "><div class="tile_product-standard__image-wrapper"><div class="tile_product-standard__image-ratio"><picture class="tile_product-standard__image">

If someone can unlock this problem the seem block can not get it in the soup

thank you

The data you see is rendeered dynamically via JavaScript so beautifulsoup doesn't see it (it's embedded in the tags in Json form). This example will parse some information about accessories from that page:

import json
import requests
from bs4 import BeautifulSoup

url = "https://www.stihl.fr/fr/p/tronconneuses-ms-180-54057"

soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = soup.select_one("product-accessories-component")["data-initial-state"]
data = json.loads(data)

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

for r in data["results"]:
    print(r["headline"], "-", r["name"])

Prints:

STIHL Smart Connector - STIHL Smart Connector
Casque FUNCTION Basic - Casque FUNCTION Basic
Lunettes de protection DYNAMIC Contrast claires - Lunettes de protection DYNAMIC Contrast claires
Protège-oreilles CONCEPT 23 - Protège-oreilles CONCEPT 23
Salopette FUNCTION universal - Anti-coupures - Salopette FUNCTION universal
Gants DYNAMIC Duro - Gants DYNAMIC Duro
Pantalon FUNCTION Universal - Anti-coupures - Pantalon FUNCTION Universal
Chaussures anti-coupures FUNCTION - Chaussures anti-coupures FUNCTION
Carburant prêt à l'emploi pour moteurs 2-temps et 4-MIX® - Carburant MotoMix
Huile adhésive pour chaîne de tronçonneuse - Huile ForestPlus
Huile pour moteurs STIHL - Huile HP Super
Accessoire pour tronçonneuses - Clé multiple
Accessoire pour tronçonneuses - Outils Multifonctions

Concerning your comment:

hello thank you for this quick answer but I did a mistake the url is stihl.fr/fr/c/tronconneuses-98176 this one is the main category not the product from the main category i want scrap each product (that when i get the linkS I think i can handdle thank you again

What happens?

You try to scrape all links to the product details but choos find() that returns first result only.

How to fix?

Try find_all() instead to get all link elements as resultset:

soup.find_all('a',class_='m_category-overview-tiles__item tile_product-standard')

To store only the href in a list you can do the following:

links = [link['href'] for link in soup.find_all('a',class_='m_category-overview-tiles__item tile_product-standard')]

Example

import requests
from bs4 import BeautifulSoup   

url='https://www.stihl.fr/fr/c/tronconneuses-98176'
headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4)        AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"}  

r = requests.get(url, headers=headers)

soup =BeautifulSoup(r.content, 'lxml')
links = [link['href'] for link in soup.find_all('a',class_='m_category-overview-tiles__item tile_product-standard')]
print(links)

Output

['/fr/p/tronconneuses-ms-180-54057#ms-180-54057', '/fr/p/tronconneuses-ms-170-2323#ms-170-2323', '/fr/p/tronconneuses-msa-140-systeme-ak-76261#c-b-sans-batterie-ni-chargeur-76261', '/fr/p/tronconneuses-mse-141-76267#mse-141-76267', '/fr/p/tronconneuses-msa-120-systeme-ak-73181#c-b-sans-batterie-ni-chargeur-73181', '/fr/p/tronconneuses-ms-181-1335#ms-181-1335', '/fr/p/tronconneuses-ms-251-1852#ms-251-1852', '/fr/p/tronconneuses-msa-200-gamme-ap-2173#c-b-sans-batterie-sans-chargeur-2173', '/fr/p/tronconneuses-msa-220-gamme-ap-102688#c-b-sans-batterie-sans-chargeur-102688',...]

Additional example to get title, subline, link in dict

import requests
from bs4 import BeautifulSoup   
    
url='https://www.stihl.fr/fr/c/tronconneuses-98176'
headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4)        AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"}  
    
r = requests.get(url, headers=headers)
    
soup =BeautifulSoup(r.content, 'lxml')

data =[]

for item in soup.find_all('a',class_='m_category-overview-tiles__item tile_product-standard'):
    title = item.select_one('div.tile_product-standard__title').get_text()
    subline = item.select_one('div.tile_product-standard__subline').get_text()
    link = item['href']
    data.append({
        'title': title,
        'subline': subline,
        'link': link
    })
data

Output

[{'title': 'MS 180TRONÇONNEUSES',
  'subline': 'Tronçonneuse thermique pour la coupe de bois de chauffage avec tendeur de chaîne latéral',
  'link': '/fr/p/tronconneuses-ms-180-54057#ms-180-54057'},
 {'title': 'MS 170TRONÇONNEUSES',
  'subline': 'Idéale pour la coupe de bois de chauffage ou les petits travaux',
  'link': '/fr/p/tronconneuses-ms-170-2323#ms-170-2323'},
 {'title': 'MSA 140 - Système AKTRONÇONNEUSES',
  'subline': 'Produit vendu sans batterie ni chargeur',
  'link': '/fr/p/tronconneuses-msa-140-systeme-ak-76261#c-b-sans-batterie-ni-chargeur-76261'},...]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM