简体   繁体   中英

Extract a content from <script> scrapign with BS4

I'm trying to extract the information from a "script" tag, the code is as follows

    response = requests.get("https://www.zalando.es/jordan-air-jordan-mid-zapatillas-altas-blackdark-beetrootwhitehyper-royal-joc11a024-g11.html?hl=1610800800024", headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')
 
    marca = soup.find("h3", {"class":"OEhtt9 ka2E9k uMhVZi uc9Eq5 pVrzNP _5Yd-hZ"}).text
    nombre = soup.find("h1", {"class":"OEhtt9 ka2E9k uMhVZi z-oVg8 pVrzNP w5w9i_ _1PY7tW _9YcI4f"}).text
    color = soup.find("span", {"class":"u-6V88 ka2E9k uMhVZi dgII7d z-oVg8 pVrzNP"}).text
    precio = soup.find("span", {"class":"uqkIZw ka2E9k uMhVZi FxZV-M z-oVg8 pVrzNP"}).text
    talla = soup.find("span", {"class":"u-6V88 ka2E9k uMhVZi FxZV-M z-oVg8 pVrzNP"}).text
    imagen = soup.find("img", {"class": "_6uf91T z-oVg8 u-6V88 ka2E9k uMhVZi FxZV-M _2Pvyxl JT3_zV EKabf7 mo6ZnF _1RurXL mo6ZnF PZ5eVw"})['src']


    sku355 = api + str(soup.find_all('script')[15]).split('sku":"')[3][:-137]
    sku36 = api + str(soup.find_all('script')[15]).split('sku":"')[4][:-139]
    sku365 = api + str(soup.find_all('script')[15]).split('sku":"')[5][:-139]
    sku375 = api + str(soup.find_all('script')[15]).split('sku":"')[6][:-137]
    sku38 =  api + str(soup.find_all('script')[15]).split('sku":"')[7][:-139]
    sku385 = api + str(soup.find_all('script')[15]).split('sku":"')[8][:-137]
    sku39 = api + str(soup.find_all('script')[15]).split('sku":"')[9][:-137]
    sku40 = api + str(soup.find_all('script')[15]).split('sku":"')[10][:-139]
    sku405 = api + str(soup.find_all('script')[15]).split('sku":"')[11][:-137]
    sku41 = api + str(soup.find_all('script')[15]).split('sku":"')[12][:-137]
    sku42 = api + str(soup.find_all('script')[15]).split('sku":"')[13][:-139]
    sku425 = api + str(soup.find_all('script')[15]).split('sku":"')[14][:-137]
    sku43 = api + str(soup.find_all('script')[15]).split('sku":"')[15][:-125]

    print (sku3555)
    print (sku36)
    print (sku365)
    print (sku375)
    print (sku38)
    print (sku385)
    print (sku39)
    print (sku40)
    print (sku405)
    print (sku41)
    print (sku42)
    print (sku425)
    print (sku43)

Everything works perfect with these shoes, but when I switch for example to this link it gives me something else, what I would like to take out is the SKU of each size, regardless of the link that puts

https://www.zalando.es/nike-sportswear-air-force-1-gtx-unisex-zapatillas-anthraciteblackbarely-grey-ni115o01u-q11.html

Could not reproduce your example, would be cool to improve your question.

Just in case

If you just wanna grab the sizes, try the following:

import requests, json
from bs4 import BeautifulSoup

headers = {"user-agent": "Mozilla/5.0"}
response = requests.get("https://www.zalando.es/jordan-air-jordan-mid-zapatillas-altas-blackdark-beetrootwhitehyper-royal-joc11a024-g11.html?hl=1610800800024", headers=headers)

soup = BeautifulSoup(response.content, 'lxml')

json_object = json.loads(soup.select_one('script#z-vegas-pdp-props').contents[0].split('CDATA')[1].split(']>')[0])

for item in json_object[0]['model']['articleInfo']['units']:
    print('sku:{0} - size:{1}'.format(item['id'],item['size']['local']))

Output

sku:JOC11A024-G110005000 - size:35.5
sku:JOC11A024-G110055000 - size:36
sku:JOC11A024-G110006000 - size:36.5
sku:JOC11A024-G110065000 - size:37.5
sku:JOC11A024-G110007000 - size:38
sku:JOC11A024-G110075000 - size:38.5
sku:JOC11A024-G110008000 - size:39
sku:JOC11A024-G110085000 - size:40
sku:JOC11A024-G110009000 - size:40.5
sku:JOC11A024-G110095000 - size:41
sku:JOC11A024-G110010000 - size:42
sku:JOC11A024-G110105000 - size:42.5
sku:JOC11A024-G110011000 - size:43

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM