简体   繁体   中英

using CDATA in an xml file for to parsing html data

I have a xml file with a malformed HTML in its content .. Since xml cannot parse html tags like <br> I have used CDATA for saving and parsing .

I have used documentBuilder.setCoalescing(true) ; while parsing for recovering data <![CDATA[<br>test<br>data<br>]]> without CDATA tag ..

but in the optput < and > tags are replaced by &lt; and &gt; &lt; and &gt; respectively ..

I m expecting this string in result ...

<br>test<br>data<br>

in the parsed string .

How to do this ? Any Idea ? Thanks in advance !

UPDATE:I have two more Questions in follow up ..

1.Is there any way to make a malformed HTML (eg. <br> ) to parsable xml (eg. <br/> ) via code , if so will it handle &nbsp; also ?

2.Is there any solution to convert a html text to plain text via java (eg. <div>test&nbsp;text</div> to test text )?

Coalescing means that the parser will convert CDATA nodes to Text nodes. When the document is serialized to XML, of course the text content (HTML) must be escaped. If you want to do something with the HTML you must first extract it as text--then you can render it in a browser, or whatever.

UPDATE:

1) You can use JTidy, http://jtidy.sourceforge.net/index.html , to parse the HTML content and produce XML or XHTML. Something like this:

DocumentBuilder db = factory.newDocumentBuilder();
Document doc = db.parse(..)); // parse your input document

// Obtain the HTML content, may be buried deeper down or
// or scattered around in different places
String text = doc.getDocumentElement().getTextContent();

// Parse with JTidy to convert from HTML to XHTML
Tidy tidy = new Tidy();
tidy.setXHTML(true);

Document htmlDoc = tidy.parseDOM(new StringReader(text), null);
Transformer t = TransformerFactory.newInstance().newTransformer();
t.setOutputProperty(OutputKeys.INDENT, "yes");
t.transform(new DOMSource(htmlDoc), new StreamResult(System.out));

2) Yes. When you have the parsed htmlDoc (above) you can travserse it or apply XPATH or whatever to extract the wanted text pieces. Just remember that will be unescaped to '\ '. So if want really plain text, you should perhaps do

String s = text.replace('\u00A0', ' ');

Coalescing is an operation where the contents of CDATA sections (nodes) are converted to text nodes and merged with the contents of adjacent text nodes. This requirement in itself of converting CDATA sections to text nodes will impose the restriction that the resulting text nodes be composed of valid XML characters. This will preserve original document formatting; in other words, the structure of the nodes in the original document will not undergo a change.

The resulting behavior is that of the 5 predefined entities - <, >, &, " and ' , the first three will be expanded, for their unaltered presence will change document structure.

In short, you cannot do what you intend to do, by extracting values from the DOM. You'll need to decode the values into what you desire, after parsing the document. Apache Commons Lang has a utility class - StringEscapeUtils that possesses the desired method .

If you are simply troubled by ill-formed XML, you might consider the tidy tool which can turn your HTML into well-formed XML.

In general, you'll need an XML parser that lets you access the raw content of the CDATA marked sections and then put that raw data to whatever use you have in mind.

@Billu: You can have a look at apache open library:- org.apache.commons.lang.StringEscapeUtils. This class got escapeXML()/escapeHTML() and unescapeXML()/escapeHTML() methods. for example to your first problem about converting < and > you can use unescapeHTML(your-data).

You may not even need to store/pass data in CDATA section, you can just use escapeXML(data) at sending/storing end; and user unescapeXML(data) at receiving/retreival end.

for more information, here is the link:- StringEscapeUtils

Please let me know if aboe information helped you.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM