Reading XML tag from MediaWiki using Java

Question

I need to read output of 'search' tag from following url usign Java.

First I need to read XML into some string from following URL: http://en.wikipedia.org/w/api.php?format=xml&action=query&list=search&srlimit=1&srsearch=big+brother

I should end up having this:

<api>
<query-continue>
<search sroffset="1"/>
</query-continue>
<query>
<searchinfo totalhits="55180"/>
<search>
<p ns="0" title="Big Brothers Big Sisters of America" snippet="<span class='searchmatch'>Big</span> <span class='searchmatch'>Brothers</span> <span class='searchmatch'>Big</span> Sisters of America is a 501(c)(3) non-profit organization whose goal is to help all children reach their potential through <b>...</b> " size="13008" wordcount="1906" timestamp="2014-04-15T06:46:01Z"/>
</search>
</query>
</api>

Then once I have the XML, I need to get content of the search tag: Output of 'search' tag looks like this and I need to get two parts from the code in the middle:

<search>
<p ns="0" title="Big Brothers Big Sisters of America" snippet="<span class='searchmatch'>Big</span> <span class='searchmatch'>Brothers</span> <span class='searchmatch'>Big</span> Sisters of America is a 501(c)(3) non-profit organization whose goal is to help all children reach their potential through <b>...</b> " size="13008" wordcount="1906" timestamp="2014-04-15T06:46:01Z"/>
</search>

At the end, all I need is to have two strings, which would equal to this:

String title = Big Brothers Big Sisters of America
String snippet = "<span class='searchmatch'>Big..."

Can someone please help me amending this code, I am not sure what I am doing wrong. I don't think it's even retrieving XML from url, much less the tags inside the XML.

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse("http://en.wikipedia.org/w/api.php?format=xml&action=query&list=search&srlimit=1&srsearch=big+brother");
doc.getDocumentElement().normalize();

XPathFactory xFactory = XPathFactory.newInstance();
XPath xpath = xFactory.newXPath();
XPathExpression expr = xpath.compile("//query/search/text()");
Object result = expr.evaluate(doc, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
for (int i=0; i<nodes.getLength();i++){
System.out.println(nodes.item(i).getNodeValue());
}

Sorry, I am a newbie and can't find the answer to this anywhere.

Answer 1

The main problem here is that you're asking for text nodes that are children of <search> , but in fact the <p ..> that you want is not a text node: it's an element. (In fact, the <search> element has no text node children, as you can tell when you view the response from that URL using "View Source".)

So what you want to do is change your XPath expression to

//query/search/p

which will give you the p element node. Then ask for the value of this node's two attributes title and snippet in your Java code:

Element e = (Element)(nodes.item(i));
String title = e.getAttribute("title");
String snippet = e.getAttribute("snippet");

Or, you could do two XPath queries, one for each attribute:

//query/search/p/@title

and

//query/search/p/@snippet

assuming there will only be one <p> element. If you were doing this over multiple <p> elements, you'd probably want to keep each pair of attributes together instead of having two separate lists of results.

Reading XML tag from MediaWiki using Java

Question

1 answers

solution1
2 ACCPTED 2014-09-03 15:44:17

Reading XML tag from MediaWiki using Java

Question

1 answers

solution1 2 ACCPTED 2014-09-03 15:44:17

solution1
2 ACCPTED 2014-09-03 15:44:17