繁体   English   中英

如何使用Jsoup获取脚本条目内的变量

[英]How to fetch variables inside script entry using Jsoup

我正在使用Jsoup爬网数据。

页面源代码的一部分看起来像这样:

<script> var uitkformatter = { dependency: ['uitk_localized_dateApi', 'uitk_localized_priceApi', 'uitk_localized_config'] }; </script><script async defer src="//www.expedia.com/i18n/28/en_US/JPY/currencyFormats.js?module=exp_currencyformats_JPY"></script><script> define('exp_currencyformats', [ 'exp_currencyformats_JPY' ], function() { return window.uitkformatter; }); </script><script async defer src="//b.travel-assets.com/uitoolkit/2-164/3542359672ff5cd9d827c16bd754bf539fd383b1/core/js/uitk-localize-bundle-min.js"></script>
<script language="javascript" type="text/javascript">
OlAltLang = 'en-us.';
</script>
<script type="text/javascript">
'use strict';
require('infositeApplication', function(infositeApplication) {
infositeApplication.start();
});
define('infosite/env', function() {
return {
isJP: true,
isVN: false,
isVSC:false,
isTD:false
};
});
define('infositeData', [], function() {
var infosite = {};
infosite.hotelId = '5522663';
infosite.guid = '59ad4387-979f-477a-901a-6070f3879ce6';
infosite.token = '6a06f2f73106c754340f7a459f5d75d588637caa'; <--This I need to fetch

如果我想获取infosite.token,该怎么办?

样例代码:

//Fetch HTML code
Document document = Jsoup.connect(URL).get();
//Parse the HTML to extract links to other URLs
 Elements linksOnPage = document.select("QUERY TO FETCH infosite.token");

我无法弄清楚应在"QUERY TO FETCH infosite.token"写什么

我试过了

Element linksOnPage = document.select("script:contains(infosite.token)").first(); 但是没有用。

您可以找到所有script元素,如下所示:

Elements scriptElements = doc.getElementsByTag("script");

然后,您可以遍历脚本元素并使用正则表达式来查找变量分配(例如infosite.token = '6a06f2f73106c754340f7a459f5d75d588637caa'; ),然后抓住相关变量分配的右侧。

例如,以下代码将打印出'6a06f2f73106c754340f7a459f5d75d588637caa'

String html =
        "<html><script> var uitkformatter = { dependency: ['uitk_localized_dateApi', 'uitk_localized_priceApi', 'uitk_localized_config'] }; </script><script async defer src=\"//www.expedia.com/i18n/28/en_US/JPY/currencyFormats.js?module=exp_currencyformats_JPY\"></script><script> define('exp_currencyformats', [ 'exp_currencyformats_JPY' ], function() { return window.uitkformatter; }); </script><script async defer src=\"//b.travel-assets.com/uitoolkit/2-164/3542359672ff5cd9d827c16bd754bf539fd383b1/core/js/uitk-localize-bundle-min.js\"></script>\n" +
                "<script language=\"javascript\" type=\"text/javascript\">\n" +
                "OlAltLang = 'en-us.';\n" +
                "</script>\n" +
                "<script type=\"text/javascript\">\n" +
                "'use strict';\n" +
                "require('infositeApplication', function(infositeApplication) {\n" +
                "infositeApplication.start();\n" +
                "});\n" +
                "define('infosite/env', function() {\n" +
                "return {\n" +
                "isJP: true,\n" +
                "isVN: false,\n" +
                "isVSC:false,\n" +
                "isTD:false\n" +
                "};\n" +
                "});\n" +
                "define('infositeData', [], function() {\n" +
                "var infosite = {};\n" +
                "infosite.hotelId = '5522663';\n" +
                "infosite.guid = '59ad4387-979f-477a-901a-6070f3879ce6';\n" +
                "infosite.token = '6a06f2f73106c754340f7a459f5d75d588637caa'; </script></html>";

Document doc = Jsoup.parse(html);

Elements scriptElements = doc.getElementsByTag("script");

// the script elements have no identifying charateristic so we must loop
// until we find the one which contains the "infosite.token" variable
for (Element element : scriptElements) {
    if (element.data().contains("infosite.token")) {
        // find the line which contains 'infosite.token = <...>;'
        Pattern pattern = Pattern.compile(".*infosite\\.token = ([^;]*);");
        Matcher matcher = pattern.matcher(element.data());
        // we only expect a single match here so there's no need to loop through the matcher's groups
        if (matcher.find()) {
            System.out.println(matcher.group(1));
        } else {
            System.err.println("No match found!");
        }
        break;
    }
}

在尝试了几种变体之后,我得到了答案。

Elements linksOnPage = document.select("script");
Matcher matcher = null;

Pattern pattern = Pattern.compile("infosite\\.token = \'(.+?)\'");

for (Element element : linksOnPage){
      for (DataNode node : element.dataNodes()){
           matcher = pattern.matcher(node.getWholeData());
              while (matcher.find()){
                 System.out.println(matcher.group());
                 System.out.println(matcher.group(1));
              }
      }
}

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM