Java XML解析器错误从Word复制/粘贴时，无效字符Unicode 0x1A

Question

Sorry to double post. 对不起，我要双重发布。 But my earlier post was based on Flex: 但是我之前的帖子是基于Flex的：

Flex TextArea - copy/paste from Word - Invalid unicode characters on xml parsing Flex TextArea-从Word复制/粘贴-xml解析中的无效unicode字符

But now I'm posting this on the Java side. 但是现在我将其发布在Java方面。

The issue is: 问题是：

We have an email functionality (part of our application) where we create an XML string & put it on the queue. 我们有一个电子邮件功能（应用程序的一部分），我们在其中创建XML字符串并将其放在队列中。 Another application picks it up, parses the XML & sends out emails. 另一个应用程序将其提取，解析XML并发送电子邮件。

We get an XML parser exception when the email text (<BODY>....</BODY) is copy/pasted from Word: 当从Word复制/粘贴电子邮件文本(<BODY>....</BODY)时，我们得到一个XML解析器异常：

Invalid character in attribute value BODY (Unicode: 0x1A)

As we use Java as well, I'm trying to remove the invalid characters from the String using: 当我们也使用Java时，我正在尝试使用以下方法从String中删除无效字符：

body = body.replaceAll("‘", "");
body = body.replaceAll("’", "");

//Strip invalid characters //去除无效字符

public String stripNonValidXMLCharacters(String in) {
        StringBuffer out = new StringBuffer(); // Used to hold the output.
        char current; // Used to reference the current character.

        if (in == null || ("".equals(in))) {
            return ""; // vacancy test.
        }
        for (int i = 0; i < in.length(); i++) {
            //NOTE: No IndexOutOfBoundsException caught here; it should not happen.
            current = in.charAt(i); 
            if ((current == 0x9) 
                    || (current == 0xA) 
                    || (current == 0xD) 
                    || ((current >= 0x20) && (current <= 0xD7FF)) 
                    || ((current >= 0xE000) && (current <= 0xFFFD)) 
                    || ((current >= 0x10000) && (current <= 0x10FFFF)))
                out.append(current);
        }
        return out.toString();
    }

//Strip once more //再次剥离

private String stripNonValidXMLCharacter(String in) {      
        if (in == null || ("".equals(in))) { 
            return null;
        }
        StringBuffer out = new StringBuffer(in);
        for (int i = 0; i < out.length(); i++) {
            if (out.charAt(i) == 0x1a) {
                out.setCharAt(i, '-');
            }
        }
        return out.toString();
    }

//Replace the special characters if any //替换特殊字符（如果有）

 emailText = emailText.replaceAll("[\\u0000-\\u0008\\u000B\\u000C" 
                        + "\\u000E-\\u001F" 
                        + "\\uD800-\\uDFFF\\uFFFE\\uFFFF\\u00C5\\u00D4\\u00EC"
                        + "\\u00A8\\u00F4\\u00B4\\u00CC\\u2211]", " ");
            emailText = emailText.replaceAll("[\\x00-\\x1F]", "");
            emailText = emailText.replaceAll(
                                    "[\\x00-\\x08\\x0b\\x0c\\x0e-\\x1f]", "");
            emailText = emailText.replaceAll("\\p{C}", "");

But they still do not work. 但是它们仍然不起作用。 Also the XML string starts with: XML字符串也以：

 <?xml version="1.0" encoding="UTF-8"?>  
                    <EMAILS xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNameSpaceSchemaLocation=".\\SMTPSchema.xsd\">

I think the issue occurs when there are multiple Tabs in the Word doc. 我认为在Word文档中有多个选项卡时会发生此问题。 Like for eg. 例如。

Text......text
<newLine>
<tab><tab><tab> text...text
<newLine>

The resulting xml string is: 生成的xml字符串为：

<?xml version="1.0" encoding="UTF-8"?> <EMAILS xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNameSpaceSchemaLocation=".\SMTPSchema.xsd"> <EMAIL SOURCE="t@t.com" DEST="t@t.com" CC="" BCC="t@t.com" SUBJECT="test 61" BODY="As such there was no mechanism constructed to migrate the enrollment user base to Data Collection or to keep security attributes for common users in sync between the two systems.  The purpose of this document is to outline two strategies for bring the user base between the two applications into sync.?  It still is the same.  ** Please note: This e-mail message was sent from a notification-only address that cannot accept incoming e-mail. Please do not reply to this message."/> </EMAILS>

Please note then the "?" 请注意然后“？” is where there are multiple tabs in the Word doc. 在Word文档中有多个选项卡的位置。 Hope my question is clear & someone can help in resolving the issue 希望我的问题很清楚，有人可以帮助解决问题

Thanks 谢谢

Answer 1

您是否尝试过使用TagSoup / JSoup / JTidy等XML库来清理XML？

Answer 2

The invalid (hidden) character was from the UI (Flex TextArea). 无效（隐藏）字符来自UI（Flex TextArea）。 So had to take care of that in the UI so that it does not pass over to Java as well. 因此必须在UI中进行处理，以使其也不会传递给Java。 Handled & removed it using the chagingHandler in the Flex textArea to restrict the characters. 使用Flex textArea中的chagingHandler来限制字符的处理和删除。

Java XML解析器错误从Word复制/粘贴时，无效字符Unicode 0x1A

问题描述

2 个解决方案

解决方案1
0 2012-10-22 15:29:15

解决方案2
0 已采纳 2012-11-01 17:34:26

Java XML解析器错误从Word复制/粘贴时，无效字符Unicode 0x1A

问题描述

2 个解决方案

解决方案1 0 2012-10-22 15:29:15

解决方案2 0 已采纳 2012-11-01 17:34:26

解决方案1
0 2012-10-22 15:29:15

解决方案2
0 已采纳 2012-11-01 17:34:26