简体   繁体   中英

Parse html using xpath

Need get html from server and parse it using xpath (xpath is necessarily i can't using something else). My code :

TagNode tagNode = new HtmlCleaner().clean(html);
Document doc = new DomSerializer(new CleanerProperties()).createDOM(tagNode);
XPath xpathObject = XPathFactory.newInstance().newXPath();
NodeList nodes = (NodeList)xpathObject.evaluate(xpathString, html, XPathConstants.NODESET);

It works good but function clean() take a lot of time (for a page can take > 30 s).

I found other solution - using Jsoup, so my new code is -

Document doc = Jsoup.parse(html);
W3CDom w3cDom = new W3CDom();
org.w3c.dom.Document w3cDoc = w3cDom.fromJsoup(doc);

XPath xpathObject = XPathFactory.newInstance().newXPath();
str = (String) xpathObject.evaluate(xpathString, w3cDoc, XPathConstants.STRING);

Now parse and convert to org.w3c.dom.Document take about 1s + evaluate 0.4s ~ 1.5 second. But this is very slow too.

How can I increase speed of processing more?

We use regex patterns over one string containing HTML. This approach is more stable when HTML document occasionally has structure changes (after a page redesign, etc.)

How can I increase speed of processing more?

Move away from DOM based parsers (memory intensive) and move towards an event based approach (SAX parsers).

https://en.wikipedia.org/wiki/Simple_API_for_XML

With a sax parser you basically implement a stack to extract the nodes of interest.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM