Parse html using xpath

Question

Need get html from server and parse it using xpath (xpath is necessarily i can't using something else). My code :

TagNode tagNode = new HtmlCleaner().clean(html);
Document doc = new DomSerializer(new CleanerProperties()).createDOM(tagNode);
XPath xpathObject = XPathFactory.newInstance().newXPath();
NodeList nodes = (NodeList)xpathObject.evaluate(xpathString, html, XPathConstants.NODESET);

It works good but function clean() take a lot of time (for a page can take > 30 s).

I found other solution - using Jsoup, so my new code is -

Document doc = Jsoup.parse(html);
W3CDom w3cDom = new W3CDom();
org.w3c.dom.Document w3cDoc = w3cDom.fromJsoup(doc);

XPath xpathObject = XPathFactory.newInstance().newXPath();
str = (String) xpathObject.evaluate(xpathString, w3cDoc, XPathConstants.STRING);

Now parse and convert to org.w3c.dom.Document take about 1s + evaluate 0.4s ~ 1.5 second. But this is very slow too.

How can I increase speed of processing more?

Answer 1

We use regex patterns over one string containing HTML. This approach is more stable when HTML document occasionally has structure changes (after a page redesign, etc.)

Answer 2

How can I increase speed of processing more?

Move away from DOM based parsers (memory intensive) and move towards an event based approach (SAX parsers).

https://en.wikipedia.org/wiki/Simple_API_for_XML

With a sax parser you basically implement a stack to extract the nodes of interest.

Parse html using xpath

Question

2 answers

solution1
0 2016-01-18 15:43:40

solution2
0 2016-01-22 08:49:45

Parse html using xpath

Question

2 answers

solution1 0 2016-01-18 15:43:40

solution2 0 2016-01-22 08:49:45

solution1
0 2016-01-18 15:43:40

solution2
0 2016-01-22 08:49:45