简体   繁体   English

如何使用Python分割Javascript代码(bs4)

[英]How to split Javascript code (bs4) using Python

So I have been having some issues when trying to scrape Javascript values from a bs4 code. 因此,在尝试从bs4代码中抓取Javascript值时遇到了一些问题。

Basically the javascript looks like 基本上javascript看起来像

<script type="text/javascript">
var FancyboxI18nClose = 'Close';
var FancyboxI18nNext = 'Next';
var FancyboxI18nPrev = 'Previous';
var PS_CATALOG_MODE = false;
var ajaxsearch = true;
var attribute_anchor_separator = '-';
var blocksearch_type = 'top';
var combinationsFromController = {"163972":{"attributes_values":{"15":"40"},"attributes":[75],"price":0,"specific_price":false,"ecotax":0,"weight":0.6,"quantity":1,"reference":"IDP20059--IDPA163972","unit_impact":0,"minimal_quantity":"1","date_formatted":"","available_date":"","id_image":-1,"list":"'75'"}};
var comparator_max_item = 0;
</script>

and what I am trying to do here is to scrape the value var combinationsFromController = however what I tried to do is: 我想在这里做的是刮取值var combinationsFromController =但是我想做的是:

bs4 = soup(requests.text, 'html.parser')

for nosto_sku_tag in bs4.find_all('script', {'type': 'text/javascript'}):
    if 'combinationsFromController' in nosto_sku_tag.text.strip():
        print(nosto_sku_tag)
        for att, values in json.loads(
                re.findall('var combinationsFromController = (\{.*}?);', nosto_sku_tag.text.strip())[0][:-1]).values():
            print(values)

Which gives me an error of Expecting ',' delimiter: line 1 column 4112 (char 4111) 这给我一个Expecting ',' delimiter: line 1 column 4112 (char 4111)的错误Expecting ',' delimiter: line 1 column 4112 (char 4111)

I did realized that whenever I try to do 我确实意识到,每当我尝试做

for nosto_sku_tag in bs4.find_all('script', {'type': 'text/javascript'}):
    if 'combinationsFromController' in nosto_sku_tag.text.strip():
        print(nosto_sku_tag)
        print("---------")

The outprint gives me: 这份摘要给了我:

var FancyboxI18nClose = 'Close';
var FancyboxI18nNext = 'Next';
var FancyboxI18nPrev = 'Previous';
var PS_CATALOG_MODE = false;
var ajaxsearch = true;
var attribute_anchor_separator = '-';
var blocksearch_type = 'top';
var combinationsFromController = {"163972":{"attributes_values":{"15":"40"},"attributes":[75],"price":0,"specific_price":false,"ecotax":0,"weight":0.6,"quantity":1,"reference":"IDP20059--IDPA163972","unit_impact":0,"minimal_quantity":"1","date_formatted":"","available_date":"","id_image":-1,"list":"'75'"}};
var comparator_max_item = 0;
----------------------------

Which seems to mean that the javascript code is as one code which I believe maybe needs to split, However I tried to use regex for it but it didn't help me. 这似乎意味着javascript代码是我认为可能需要拆分的一个代码,但是我尝试对其使用正则表达式,但对我没有帮助。

So my question is how am I able to scrape ONLY the value var combinationsFromController = ? 所以我的问题是我怎么能凑var combinationsFromController =

Use the following regex pattern to isolate the entire javascript object which is assigned to that variable. 使用以下正则表达式模式隔离分配给该变量的整个javascript对象。

combinationsFromController = (.*?);

Try it here . 在这里尝试。

Eg 例如

import requests, re, json

r = requests.get(url)
p = re.compile(r'combinationsFromController = (.*?);', re.DOTALL)
data = json.loads(p.findall(r.text)[0])

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM