简体   繁体   English

如何在里面获取文本<script> tag

[英]How to get text within <script> tag

I am scraping the LaneBryant website .我正在抓取LaneBryant 网站

Part of the source code is部分源代码是

<script type="application/ld+json">
{
"@context": "http://schema.org/",
"@type": "Product",
"name": "Flip Sequin Teach & Inspire Graphic Tee",
"image": [
"http://lanebryant.scene7.com/is/image/lanebryantProdATG/356861_0000015477",
"http://lanebryant.scene7.com/is/image/lanebryantProdATG/356861_0000015477_Back"
],
"description": "Get inspired with [...]",
"brand": "Lane Bryant",
"sku": "356861",
"offers": {
"@type": "Offer",
"url": "https://www.lanebryant.com/flip-sequin-teach-inspire-graphic-tee/prd-356861",
"priceCurrency": "USD",
"price":"44.95",
"availability": "http://schema.org/InStock",
"itemCondition": "https://schema.org/NewCondition"
}
}
}
}
</script>

In order to get price in USD, I have written this script:为了获得美元价格,我编写了这个脚本:

 def getPrice(self,start):
            fprice=[]
            discount = ""


            price1 = start.find('script', {'type': 'application/ld+json'})
            data = ""
            #print("price 1 is + "+ str(price1)+"data is "+str(data))
            price1 = str(price1).split(",")
            #price1=str(price1).split(":")
            print("final price +"+ str(price1[11]))

where start is :哪里开始是:

        d = webdriver.Chrome('/Users/fatima.arshad/Downloads/chromedriver')
        d.get(url)
        start = BeautifulSoup(d.page_source, 'html.parser')

It doesn't print the price even though I am getting correct text.即使我收到正确的文本,它也不会打印价格。 How do I get just the price?我如何获得价格?

In this instance you can just regex for the price在这种情况下,您可以为价格使用正则表达式

import requests, re

r = requests.get('https://www.lanebryant.com/flip-sequin-teach-inspire-graphic-tee/prd-356861#color/0000015477', headers = {'User-Agent':'Mozilla/5.0'})
p = re.compile(r'"price":"(.*?)"')
print(p.findall(r.text)[0])

Otherwise, target the appropriate script tag by id and then parse the .text with json library否则,通过 id 定位适当的脚本标签,然后使用 json 库解析 .text

import requests, json
from bs4 import BeautifulSoup 

r = requests.get('https://www.lanebryant.com/flip-sequin-teach-inspire-graphic-tee/prd-356861#color/0000015477', headers = {'User-Agent':'Mozilla/5.0'})
start = BeautifulSoup(r.text, 'html.parser')
data = json.loads(start.select_one('#pdpInitialData').text)
price = data['pdpDetail']['product'][0]['price_range']['sale_price']
print(price)
price1 = start.find('script', {'type': 'application/ld+json'})

This is actually the <script> tag, so a better name would be这实际上是<script>标签,所以更好的名字是

script_tag = start.find('script', {'type': 'application/ld+json'})

You can access the text inside the script tag using .text .您可以使用.text访问脚本标签内的文本。 That will give you the JSON in this case.在这种情况下,这将为您提供 JSON。

json_string = script_tag.text

Instead of splitting by commas, use a JSON parser to avoid misinterpretations:不要用逗号分割,而是使用 JSON 解析器来避免误解:

import json    
clothing=json.loads(json_string)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM