简体   繁体   English

Java:如何在给定的 w3c dom 文档上加速 xpath 字符串的生成?

[英]Java: How to speed up the xpath string generation on a given w3c dom document?

I have the following method which takes a org.w3c.dom.Document and generate an absolute xpath string.我有以下方法,它采用 org.w3c.dom.Document 并生成绝对 xpath 字符串。

I notice it takes long time to go through hundreds of elements on a page.我注意到通过页面上的数百个元素来 go 需要很长时间。

Is there anyway to speed it up or a different approach perhaps?反正有没有加快速度或不同的方法?

Important note: I am only given org.w3c.dom document重要提示:我只得到 org.w3c.dom 文件

   public String getElementXpath(DOMElement elt){
            String path = "";          

            for (Node fib = (Node) elt; fib != null; fib = fib.getParentNode()){                
                if (fib.getNodeType() == Node.ELEMENT_NODE){

                    DOMElement thisparent = (DOMElement) fib;
                    int idx = getElementIdx(thisparent);
                    String xname = thisparent.getTagName();

                        if (idx >= 1) xname += "[" + idx + "]";
                        path = "/" + xname + path;
                }
            }
            return path;           
        }

        private int getElementIdx(DOMElement elt) {
             int count = 1;
             for (Node sib = elt.getPreviousSibling(); sib != null; sib = sib.getPreviousSibling())
                {
                    if (sib.getNodeType() == Node.ELEMENT_NODE){
                        DOMElement thiselement = (DOMElement) sib;
                        if(thiselement.getTagName().equals(elt.getTagName())){
                            count++;
                        }
                    }
                }

            return count;
        }

Your code is O(n^2) in the number of siblings (that is, the maximum fan-out of the tree).您的代码在兄弟姐妹的数量上是 O(n^2) (即树的最大扇出)。

Given any DOM problem, a better approach is always to avoid using DOM.考虑到任何 DOM 问题,更好的方法始终是避免使用 DOM。 But I don't know if that's an option in your case.但我不知道这是否是您的选择。

A less radical change would be to change your code so that, as it walks the children of a node, it maintains a hashmap containing for each element name encountered, the number of elements with that name, and then use this information to generate the subscript (index) rather than counting back through all the previous siblings.一个不太激进的更改是更改您的代码,以便在遍历节点的子节点时,它维护一个 hashmap 包含遇到的每个元素名称、具有该名称的元素的数量,然后使用此信息生成下标(index) 而不是倒数过去所有的兄弟姐妹。

I am not sure whether you generate XPaths for multiple or just a single node in each DOM document, but if you generate multiple, then you can cache the expressions as suggested by others.我不确定您是为每个 DOM 文档中的多个节点还是只为单个节点生成 XPath,但如果您生成多个,那么您可以按照其他人的建议缓存表达式。 Hard to estimate, but if you want to generate very many XPaths from the same document, you might as well reverse the algorithm to start with the root element.很难估计,但是如果您想从同一个文档中生成很多 XPath,您不妨反转算法以从根元素开始。 And note that you can normalize text nodes if you have a lot, but I am unsure of the overall performance;)请注意,如果您有很多文本节点,您可以规范化文本节点,但我不确定整体性能;)

But regardless, iteration over the DOM nodes is really fast.但无论如何,对 DOM 节点的迭代非常快。 But your String handling is not , in fact it is somewhat bad.但是您的 String 处理不是,实际上它有些糟糕。 Switch to a single StringBuilder (thanks, Alvin) instead of your current approach (using + to append Strings is compiled into something more compcliated, see javadoc).切换到单个StringBuilder (感谢 Alvin)而不是您当前的方法(使用 + 到 append 字符串被编译成更复杂的东西,请参阅 javadoc)。 Make sure you initialize it to a good size in the constructor.确保在构造函数中将其初始化为合适的大小。

You do not really need to check the tag name either, any-name element type is allowed in XPath.您也不需要检查标签名称,XPath 中允许使用任意名称元素类型。 Like /*[1]/*[2] for example.例如/*[1]/*[2]

=== New - So you need to use DOM === === 新 - 所以你需要使用 DOM ===

To speed things up you can do caching (like the other person suggested).为了加快速度,您可以进行缓存(就像其他人建议的那样)。 Notice your current code computes the xpath for the same node multiple times (or each node N you will have to compute xpath for N for each of N's children).请注意,您当前的代码多次计算同一节点的 xpath(或者每个节点 N,您必须为 N 的每个子节点计算 N 的 xpath)。 Here is what I have in mind for caching:这是我对缓存的想法:

HashMap<Node, String> xpathCache;
HashMap<Node, Integer> nodeIndexCache;

public String getElementXpath(DOMElement elt){
            String path = "";

            for (Node fib = (Node) elt; fib != null; fib = fib.getParentNode()){                
                if (fib.getNodeType() == Node.ELEMENT_NODE){

                    String cachedParentPath = xpathCache.get(fib);

                    if (cachedParentPath != null){
                        path = cachedParentPath + path;
                        break;
                    }

                    DOMElement thisparent = (DOMElement) fib;
                    int idx = getElementIdx(thisparent);
                    String xname = thisparent.getTagName();

                        if (idx >= 1) xname += "[" + idx + "]";
                        path = "/" + xname + path;
                }
            }

            /* 
             * here, not only you know the xpath to the elt, you also 
             * know the xpath to the ancestors of elt. You can leverage
             * this to cache the ancestor's xpath as well. But I just 
             * cache the elt for illustration purpose.
             * 
             * To compute ancestor's xpath efficiently, maybe you want to 
             * store xpath using different data structure other than String.
             * Maybe a Stack of Strings?
             */
            if (! xpathCache.containsKey(elt)){
               xpathCache.put (elt, path);
            }

            return path;           
        }

private int getElementIdx(DOMElement elt) {
             Integer count = nodeIndexCache.get(elt);
             if (count != null){
               return count;
             }
             count = 1;

             LinkedList<Node> siblings = new LinkedList<Node>();
             for (Node sib = elt.getPreviousSibling(); sib != null; sib =           sib.getPreviousSibling())
                {
                   siblings.add(sib);
                }

             int offset = 0;
             for (Node n : siblings)
             {
                nodeIndexCache.put(n, siblings.size() - index);
                offset ++;
             }                

            /* 
             * you can improve index caching even further by doing it in the
             * above for loop.
             */      
            nodeIndexCache.put(elt, siblings.size()+1);

            return count;
}

It looks like you are given a random node and you have to compute the xpath by backtracing the node's path?看起来你有一个随机节点,你必须通过回溯节点的路径来计算 xpath? If what you ultimately want to achieve is to compute xpath of all the nodes, fastest way is to start with the root node and traverse through the tree, provided you have reference to the root node.如果你最终想要实现的是计算所有节点的 xpath,最快的方法是从根节点开始遍历树,前提是你有根节点的引用。

=== OLD === === 旧 ===

You can try using event-base XML parsing API instead of DOM.您可以尝试使用基于事件的 XML 解析 API 而不是 DOM。 JVM comes with an event parser called SAXParser , you can start by using that. JVM 带有一个名为SAXParser的事件解析器,您可以从使用它开始。 There is also StAX that you can try.您还可以尝试StAX

The event-based XML parser emits "events" as it does depth-first traversal instead of parsing the XML into in-memory-DOM.基于事件的 XML 解析器发出“事件”,因为它执行深度优先遍历,而不是将 XML 解析到内存中的 DOM 中。 So the event-based parser visits each element of your XML, emits event like "onOpenTag", "onClosedTag", and "onAttribute".因此,基于事件的解析器访问 XML 的每个元素,发出类似“onOpenTag”、“onClosedTag”和“onAttribute”的事件。 By writing an event handler, you can build and/or store the paths of the elements like this:通过编写事件处理程序,您可以构建和/或存储元素的路径,如下所示:

...
currentPath=new Stack();

onOpenTag(String tagName){
   this.currentPath.push("tagName");

   if ("Item".equals(tagName)){
      cache.store(convertToPathString(currentPath));
   }
}

onCloseTag(String tagName){
   this.currentPath.pop();
}

Nice thing about event-based API is it's fast and saves a lot of memory for big XML.基于事件的 API 的好处是它速度快,并且为大 XML 节省了大量的 memory。

Bad thing about it is you have to write mode code to get the data you want.不好的是你必须编写模式代码来获取你想要的数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM