简体   繁体   中英

Issue with Jsoup Traversing DOM Tree

I'm using jsoup to analyse some HTML source files. When I was traversing the HTML source, I found something weird:

            final HashMap <Node, Integer> idMap = new HashMap <Node, Integer>();
            doc.traverse(new NodeVisitor(){
                @Override
                public void head(Node node, int depth) {
                    int sequentialNodeId = info.sequentialId++;
                    idMap.put(node, sequentialNodeId);                  
                }

                @Override
                public void tail(Node node, int depth) {                    
                    System.out.println(idMap.get(node));                                        
                }
            });

So here I'm using idMap to store the node IDs in order to retrieve them later in the tail() method. I'm not using node.hashCode() because there are many duplicated hash codes for different nodes. I once posted a question regarding that issue, and jsoup team said it has been fixed, but it still happens to me, I'm not sure if it has something to do with the HTML source files I'm dealing with.

My problem is that idMap.get(node) throws many null pointers. If the nodes in the head and tail methods should be the same, then why would this happen?

I need to use node ID to record the DFS order of each node as well as to access a data structure that is initialized when the node is first visited, and should be modified when the node is last visited. The data structure is unique to each node. I don't know if there is any other way to do this. Any input would be greatly appreciated. Thanks a lot.

Try to use the latest Jsoup version (1.8.3 as of this writing) and retest your code. Feel free to leave a comment below if still doesn't work.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM