Removing duplicated newlines/tabs/whitespaces in XML character element

Question

<node> test
    test
    test
</node>

I want my XML parser read characters in <node> and:

replace newlines and tabs to spaces and compose multiple spaces into one. At result, the text should look similar to "test test test".
If the node contains XML encoded characters: tabs ( 	 ), newlines ( 
 ) or whitespaces (  ) - they should be left.

I'm trying a code below, but it preserve duplicated whitespaces.

  dbf = DocumentBuilderFactory.newInstance();
  dbf.setIgnoringComments( true );
  dbf.setNamespaceAware( namespaceAware );
  db = dbf.newDocumentBuilder();
  doc = db.parse( inputStream );

Is the any way to do what I want?

Thanks!

Answer 1

The first part - replacing multiple white-space - is relatively easy though I don't think the parser will do it for you:

InputSource stream = new InputSource(inputStream);
XPath xpath = XPathFactory.newInstance().newXPath();
Document doc = (Document) xpath.evaluate("/", stream, XPathConstants.NODE);

NodeList nodes = (NodeList) xpath.evaluate("//text()", doc,
    XPathConstants.NODESET);
for (int i = 0; i < nodes.getLength(); i++) {
  Text text = (Text) nodes.item(i);
  text.setTextContent(text.getTextContent().replaceAll("\\s{2,}", " "));
}

// check results
TransformerFactory.newInstance()
    .newTransformer()
    .transform(new DOMSource(doc), new StreamResult(System.out));

This is the hard part:

If the node contains XML encoded characters: tabs ( 	 ), newlines ( 
 ) or whitespaces (  ) - they should be left.

The parser will always turn "	" into "\\t" - you may need to write your own XML parser.

According to the author of Saxon :

I don't think any XML parser will report numeric character references to the application - they will always be expanded. Really, your application shouldn't care about this any more than it cares about how much whitespace there is between attributes.

Removing duplicated newlines/tabs/whitespaces in XML character element

Question

1 answers

solution1
1 ACCPTED 2014-04-18 16:22:31

Removing duplicated newlines/tabs/whitespaces in XML character element

Question

1 answers

solution1 1 ACCPTED 2014-04-18 16:22:31

solution1
1 ACCPTED 2014-04-18 16:22:31