简体   繁体   English

在 Java 中评估 XPath 表达式时出现异常

[英]Getting Exception on evaluating an XPath expression in Java

I am trying to learn the usage of Xpath expressions with Java.我正在尝试使用 Java 学习 Xpath 表达式的用法。 I am using Jtidy to convert the HTML page to XHTML so that I can easily parse it using XPath expressions.我正在使用 Jtidy 将 HTML 页面转换为 XHTML,以便我可以使用 XPath 表达式轻松解析它。 I have the following code:我有以下代码:

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);


DocumentBuilder builder = factory.newDocumentBuilder();
    Document doc = ConvertXHTML("https://twitter.com/?lang=fr");

//Create XPath

XPathFactory xpathfactory = XPathFactory.newInstance();
XPath Inst= xpathfactory.newXPath();
NodeList nodes = (NodeList)Inst.evaluate("//p/@align",doc,XPathConstants.NODESET);
    for (int i = 0; i < nodes.getLength(); ++i) 
   {
            Element e = (Element) nodes.item(i);
            System.out.println(e);
    }

public Document ConvertXHTML(String link){
  try{

      URL u = new URL(link);

     BufferedInputStream instream=new BufferedInputStream(u.openStream());
     FileOutputStream outstream=new FileOutputStream("out.xhtml");

     Tidy c=new Tidy();
     c.setShowWarnings(false);
     c.setInputEncoding("UTF-8");
     c.setOutputEncoding("UTF-8");
     c.setXHTML(true);

     return c.parseDOM(instream,outstream);
     }

It's working fine for most URLs but this one :它适用于大多数 URL,但这个:

https://twitter.com/?lang=fr https://twitter.com/?lang=fr

I am getting this exception because of it:我收到此异常是因为:

javax.xml.transform.TransformerException: Index -1 out of bounds..... javax.xml.transform.TransformerException: 索引 -1 越界.....

Below is a part of stack trace I am getting:下面是我得到的堆栈跟踪的一部分:

javax.xml.transform.TransformerException: Index -1 out of bounds for length 128
at java.xml/com.sun.org.apache.xpath.internal.XPath.execute(XPath.java:366)
at java.xml/com.sun.org.apache.xpath.internal.XPath.execute(XPath.java:303)
at java.xml/com.sun.org.apache.xpath.internal.jaxp.XPathImplUtil.eval(XPathImplUtil.java:101)
at java.xml/com.sun.org.apache.xpath.internal.jaxp.XPathExpressionImpl.eval(XPathExpressionImpl.java:80)
at java.xml/com.sun.org.apache.xpath.internal.jaxp.XPathExpressionImpl.evaluate(XPathExpressionImpl.java:89)
at files.ExampleCode.GetThoselinks(ExampleCode.java:50)
at files.ExampleCode.DoSomething(ExampleCode.java:113)
at files.ExampleCode.GetThoselinks(ExampleCode.java:81)
at files.ExampleCode.DoSomething(ExampleCode.java:113)

I am not sure whether the problem is in the converted xhtml of the website or something else.我不确定问题是否出在网站的转换后的 xhtml 或其他方面。 Can anyone tell what is wrong in the code?谁能说出代码中有什么问题? Any edits would be helpful.任何编辑都会有所帮助。

I would normally say that an index-of-bounds exception coming from deep within the XPath engine is a bug in the XPath engine. 我通常会说,来自XPath引擎深处的边界索引异常是XPath引擎中的错误。 The only caveat is if there's something structurally wrong with the DOM that the XPath engine is searching; 唯一的警告是XPath引擎正在搜索的DOM在结构上是否有问题; an XPath processor is entitled to make reasonable assumptions that the DOM is valid and to crash if it isn't. XPath处理器有权合理假设DOM是有效的,否则无效。 In that case it would be a bug in Tidy, which created the DOM. 在这种情况下,这将是Tidy的一个错误,该错误创建了DOM。

I had a similar problem using xpath evaluation on a document produced by JTidy.我在 JTidy 生成的文档上使用 xpath 评估时遇到了类似的问题。 I got around it by having JTidy serialize the DOM it produced to a file, and then parsing that xml file with javax.xml.parsers.DocumentBuilder to get a 2nd version of the DOM.我通过让 JTidy 将它生成的 DOM 序列化为一个文件,然后使用 javax.xml.parsers.DocumentBuilder 解析该 xml 文件以获得第二个版本的 DOM 来解决它。 Bizarre as it seems, using the 2nd one avoided the out of bounds exception and worked.看起来很奇怪,使用第二个避免了越界异常并起作用。 Use code like the following:使用如下代码:

        DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
        documentBuilderFactory.setNamespaceAware(true);
        // If you don't do the following, it will take a full minute to parse the xml document (presumably the time-out
        // period for trying to load the DTD). See https://stackoverflow.com/questions/6204827/xml-parsing-too-slow.
        documentBuilderFactory.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
        documentBuilder = documentBuilderFactory.newDocumentBuilder();
        Document doc = tidy.parseDOM(input, null);
        FileOutputStream fos = new FileOutputStream("temp.xml");
        tidy.pprint(doc, fos);
        fos.close();
        doc = documentBuilder.parse("temp.xml");

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM