简体   繁体   English

用美丽的汤解析JS

[英]Parsing JS with Beautiful soup

I have some page parsed with beautiful soup. 我有一些页面用美丽的汤解析。 But there I have js code : 但是我有js代码:

<script type="text/javascript">   


var utag_data = {
            customer_id   : "_PHL2883198554", 
            customer_type : "New",
            loyalty_id : "N",
            declined_loyalty_interstitial : "false",
            site_version  : "Desktop Site",
            site_currency: "de_DE_EURO",
            site_region: "uk",
            site_language: "en-GB",


            customer_address_zip : "",
            customer_email_hash :  "",
            referral_source :  "",
            page_type : "product",
            product_category_name : ["Lingerie"],
            product_category_id :[jQuery("meta[name=defaultParent]").attr("content")],
            product_id : ["5741462261401"],
            product_image_url : ["http://images.urbanoutfitters.com/is/image/UrbanOutfitters/5741462261401_001_b?$detailmain$"],
            product_brand : ["Pretty Polly"],
            product_selling_price : ["20.0"],
            promo_id : "6",
            product_referral : ["WOMENS-SHAPEWEAR-LINGERIE-SOLUTIONS-EU"],
            product_name : ["Pretty Polly Shape It Up Tummy Shaping Camisole"],
            is_online_only : true,
            is_back_in_stock : false
}
</script>

How can I get some values from this input? 如何从此输入中获取某些值? Should I work with this example like with text? 我应该像文本一样处理这个例子吗? I mean write it to some variable and split and then take some data? 我的意思是将它写入某个变量并拆分然后获取一些数据?

Thanks 谢谢

Once you have the text of the script via 一旦你有脚本的文本通过

js_text = soup.find('script', type="text/javascript").text

for example. 例如。 Then you can use regex to find the data, I'm sure there is an easier way to do this but regex shouldn't be hard as well. 然后你可以使用正则表达式来查找数据,我确信有一种更简单的方法可以做到这一点,但正则表达式也不应该很难。

import re
regex =  re.compile('\n^(.*?):(.*?)$|,', re.MULTILINE) #compile regex
js_text = re.findall(regex, js_text) #  find first item @ new line to : and 2nd item @ from : to the end of the line or , 
js_text = [jt.strip() for jt in js_text] #  to strip away all of the extra white space.

this will return a list of names and values in name|value|name2|value2... order which you can mess around with or convert to dictionary later on. 这将返回名称|值| name2 | value2 ...中的名称和值列表,您可以稍后使用或转换为字典。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM