简体   繁体   English

Jsoup解析器仅对特定URL无法正常工作

[英]Jsoup parser not working as expected for particular URL only

I am using Jsoup to download the page content and then for parsing it. 我正在使用Jsoup下载页面内容,然后进行解析。

public static void main(String[] args) throws IOException {
        Document document = Jsoup.connect("http://www.toysrus.ch/product/index.jsp?productId=89689681").get();
        final Elements elements = document.select("dt:contains(" + "EAN/ISBN:" + ")");
        System.out.println(elements.size());
    }

The Problem : If you view the source of page content, there is tag exist <dt> which contains EAN/ISBN: text, but if you run above code, it will give you 0 in output, while it should give me 1 . 问题:如果查看页面内容的源,则存在标签<dt> ,其中包含EAN/ISBN:文本,但是如果运行上述代码,则输出为0 ,而应为1 I have already checked html using document.html() , it seems html tags are there, but the tag I wanted is replaced by characters like &lt;dt&gt; 我已经使用document.html()检查了html,似乎有html标签,但是我想要的标签已被&lt;dt&gt; instead it should <dt> . 相反,它应该<dt> Same code is working for other product urls from same site. 相同的代码适用于来自同一站点的其他产品url。

I have already worked with Jsoup and developed many parser, but I am not getting why above very simple code is not working. 我已经使用Jsoup并开发了许多解析器,但是我不明白为什么上面非常简单的代码无法正常工作。 It's strange! 真奇怪! Is it Jsoup bug? 是Jsoup错误吗? Can anybody help me? 有谁能够帮助我?

When using connect() or parse() jsoup will per default expect a valid html and format the input automatically if needed. 当使用connect()或parse()时,默认情况下,jsoup将期望使用有效的html并在需要时自动格式化输入。 You may try the xml parser instead. 您可以尝试使用xml解析器。

    public static void main(String [] args) throws IOException { 
        String url = "http://www.toysrus.ch/product/index.jsp?productId=89689681";
        Document document = Jsoup.parse(new URL(url).openStream(), "UTF-8", "", Parser.xmlParser());
        //final Elements elements = document.select("dt:contains(" + "EAN/ISBN:" + ")");
        // the same as above but more readable:
        final Elements elements = document.getElementsMatchingOwnText("EAN/ISBN");            
        System.out.println(elements.size());
    }

You need to put single quotes around the 'EAN/ISBN:' value; 您需要在“ EAN / ISBN:”值周围加上单引号; otherwise it will be interpreted as a variable. 否则它将被解释为变量。

Also, there is no need to break up the string and concatenate pieces together. 同样,也不需要将字符串分解并连接在一起。 Just put the whole thing in one string. 只需将整个内容放在一个字符串中即可。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM