简体   繁体   中英

Java: Escape XML text content instead of entire text

I want to send the XML request below. Text content should be escaped, but not the tags.

I've tried to use below escape logic.
String str = escapeXml11(req);

However, my whole request is getting escaped. So, it is no longer valid XML.

My original string:

String req =
"<request>\r\n" 
  + " <Products>\r\n" 
    + " <Product>\r\n" 
      + " <ProductName>H < M</ProductName>\r\n" 
      + " <quantity>1</quantity>\r\n" 
      + " <totalProductCost>17.03</totalProductCost>\r\n" 
    + " </Product>\r\n" 
  + " </Products>\r\n" 
+ "</request>"; 

After escaping:

&lt;request&gt;
    &lt;ProductName&gt;H &lt; M&lt;/ProductName&gt;
    &lt;quantity&gt;1&lt;/quantity&gt;
    &lt;totalProductCost&gt;17.03&lt;/totalProductCost&gt;
&lt;/request&gt

Expected result:

<request>
    <ProductName>H &lt; M</ProductName>
    <quantity>1</quantity>
    <totalProductCost>17.03</totalProductCost>
</request>

How do i only escape the text content?

So the root of this problem is that the "XML" that the 3rd-party is providing to you is not well-formed.

<request>
  <Products>
    <Product>
      <ProductName>H < M</ProductName>
      <quantity>1</quantity>
      <totalProductCost>17.03</totalProductCost>
    </Product>
  </Products> 
</request>

To correct this, you would need to convert the "H < M" to "H &lt; M" . It is easy for a human to do this, modulo accuracy issues if the human has to do a lot of this. But automating it is difficult.

Obviously, simply calling an escape method won't work. An escape method can't determine what needs to be escaped without parsing the XML. (Methods like escapeXml11 only work if the entire string needs to be escaped.)

A normal XML parser would see the "< M" an try to treat this as the start of an element tag. Then it would see the next "<" ... and error. To proceed further, it has to backtrack to the "< M" and treat the "<" as if it was escaped .

I am aware of one HTML / XML parser (JSoup) that can deal with misplaced "<" characters. However, if I understand things correctly, it deals with this problem wrong way for your use-case. Instead of treating the "< M" as data it would turn it into a start tag:

<request>
  <Products>
    <Product>
      <ProductName>H <M></ProductName>
      <quantity>1</quantity>
      <totalProductCost>17.03</totalProductCost>
    </Product>
  </Products> 
</request>

That leaves you with two alternatives:

  • You could try to detect and fix the problem with some pattern matching. For example, if you know that the malformed data is in <ProductName>...</ProductName> elements, then you could use a regex to search for these elements, check and (if necessary) correct the content, and replace it.

  • You could write a custom parser for your XML with a context-sensitive lexer. When the parser sees a <ProductName> , it switches the lexer into a different mode that treats " < " as data unless it is the start of </ProductName> .


But before you go to the time and expense of writing a bunch of custom code to deal with this invalid XML:

  • Complain to the 3rd-party that is creating it. They should not be emitting rubbish like that. Their software or their data collection / sanitization is flawed. They should fix it.

  • Make sure that whoever is paying your software development and maintenance bills gets to know about this. For example, if you were contracted to write some software that processes XML, this is not XML. If the customer didn't warn you that your software needed to cope with malformed XML, that is a change of requirements and could be (should be) a variation of the contract.

See also @Michael Kay's comment.

Here's what I found after searching everywhere looking for a solution:

Get the Jsoup library:

<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.12.1</version>
</dependency>

Then:


Document doc = Jsoup.parse(new ByteArrayInputStream(YOUR_XML_STRING_HERE.getBytes("UTF-8")), "UTF-8", "", Parser.xmlParser())
doc.outputSettings().charset("UTF-8")
doc.outputSettings().escapeMode(Entities.EscapeMode.base)

println doc.toString()

Hope this helps someone

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM