简体   繁体   中英

How to remove the special characters in XML and should not lead to the error “Invalid byte 1 of 1-byte UTF-8 sequence” while reading this xml file

I am getting an error

Invalid byte 1 of 1-byte UTF-8 sequence

while reading an XML file in Java to generate an XSD.

Then I noticed that my XML does have some special characters like '"”“?& etc. So, I have managed to remove them in Java before I process the XML to generate the XSD. But the challenge is that it is dynamic data, so we may not know what sort of characters we will encounter.

How do we can remove these special characters smartly? So that it would match the UTF-8 encoding and never have this problem?

Could this be solved in XSLT to remove the characters?

How do we can get rid of these characters from the below part or allow without issue?

 <string>message</string>
                    <string>Very good dear laughing colours laken yeh heart bhot karap hota ha brain ke baat nahi sunte ha Allah bhagwan god Na yeh kuy banayai ha dear friends 😢 😢 😢❤👍</string>

<string>message</string>
                    <string>वक़्त 🕔 और  दोस्त_मिलते 👫 तो  मुफ्त_हैं, ☺
लेकिन  उनकी_कीमत 💵 का  अंदाज़ा 😌 तब  होता_है, ☝  जब ये कहीं  खो_जाते है ।...
#</string>

Note: I have the encoding set as UTF-8 for the XML document.

Your error sounds like your XML document contains a single-byte control character that's prohibited in XML. XML prohibits certain characters from appearing in a document; see the Char production at https://www.w3.org/TR/xml/#charsets for the list of allowed characters in XML 1.0.

You need to remove these characters before they reach the XML; otherwise your XML will be malformed, at which point it's expected that XSLT won't be able to transform your document.

If you need to transform valid XML characters, XSLT can do that with the translate function. For example, translate(Windows-1252_string, "&#x84;&#x93;&#x94;", "&#x201e;&#x201c;&#x201d;") run on all text nodes should address Windows-1252-encoded quotation marks. Of course, it'd be better to ensure that this input is fixed before it reaches XML.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM