简体   繁体   中英

Reading XML tag from MediaWiki using Java

I need to read output of 'search' tag from following url usign Java.

First I need to read XML into some string from following URL: http://en.wikipedia.org/w/api.php?format=xml&action=query&list=search&srlimit=1&srsearch=big+brother

I should end up having this:

<api>
<query-continue>
<search sroffset="1"/>
</query-continue>
<query>
<searchinfo totalhits="55180"/>
<search>
<p ns="0" title="Big Brothers Big Sisters of America" snippet="<span class='searchmatch'>Big</span> <span class='searchmatch'>Brothers</span> <span class='searchmatch'>Big</span> Sisters of America is a 501(c)(3) non-profit organization whose goal is to help all children reach their potential through <b>...</b> " size="13008" wordcount="1906" timestamp="2014-04-15T06:46:01Z"/>
</search>
</query>
</api>

Then once I have the XML, I need to get content of the search tag: Output of 'search' tag looks like this and I need to get two parts from the code in the middle:

<search>
<p ns="0" title="Big Brothers Big Sisters of America" snippet="<span class='searchmatch'>Big</span> <span class='searchmatch'>Brothers</span> <span class='searchmatch'>Big</span> Sisters of America is a 501(c)(3) non-profit organization whose goal is to help all children reach their potential through <b>...</b> " size="13008" wordcount="1906" timestamp="2014-04-15T06:46:01Z"/>
</search>

At the end, all I need is to have two strings, which would equal to this:

String title = Big Brothers Big Sisters of America
String snippet = "<span class='searchmatch'>Big..."

Can someone please help me amending this code, I am not sure what I am doing wrong. I don't think it's even retrieving XML from url, much less the tags inside the XML.

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse("http://en.wikipedia.org/w/api.php?format=xml&action=query&list=search&srlimit=1&srsearch=big+brother");
doc.getDocumentElement().normalize();

XPathFactory xFactory = XPathFactory.newInstance();
XPath xpath = xFactory.newXPath();
XPathExpression expr = xpath.compile("//query/search/text()");
Object result = expr.evaluate(doc, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
for (int i=0; i<nodes.getLength();i++){
System.out.println(nodes.item(i).getNodeValue());
}

Sorry, I am a newbie and can't find the answer to this anywhere.

The main problem here is that you're asking for text nodes that are children of <search> , but in fact the <p ..> that you want is not a text node: it's an element. (In fact, the <search> element has no text node children, as you can tell when you view the response from that URL using "View Source".)

So what you want to do is change your XPath expression to

//query/search/p

which will give you the p element node. Then ask for the value of this node's two attributes title and snippet in your Java code:

Element e = (Element)(nodes.item(i));
String title = e.getAttribute("title");
String snippet = e.getAttribute("snippet");

Or, you could do two XPath queries, one for each attribute:

//query/search/p/@title

and

//query/search/p/@snippet

assuming there will only be one <p> element. If you were doing this over multiple <p> elements, you'd probably want to keep each pair of attributes together instead of having two separate lists of results.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM