简体   繁体   中英

How to prevent illegal characters to appear in my XML when retrieving it from SQL Server

Sometimes the string values of Properties in my Classes become odd. They contain illegal characters and are displayed like this (with boxes):

123[]45[]6789

I'm assuming those are illegal/unrecognized characters. I serialize all my objects to XML and then upload them via Web Service. When I retrieve them again, some characters are replaced with oddities. This happens most often with hyphens and dashes that have been typed using Word. Is that the cause of it?

Is there anyway I can check to see if the string contains any of these unrecognized characters via regex or something?

Personally I don't think using a Regex to check for these characters is the correct solution. If you aren't storing those characters then there is obviously some sort of encoding issue.

Verify that the XML document itself is stored using the correct encoding to support the characters you need to store. Then verify when you are reading the file in that you are using the same encoding as the document ie if your XML document is stored as UTF-8 then you need to make sure when you read it in your encoding it as UTF-8.

The first thing to remember, is that there is no such thing as a "special character" or an "illegal character". There are characters that are special in certain circumstances, there are non-characters, but there are no generally "special characters" or "illegal characters".

What you have here is either:

  1. Perfectly normal characters for which your font doesn't have a glyph.
  2. Perfectly normal characters that aren't printable (eg control characters).
  3. An artefact of how the debugger works.

The first thing is to find out what that character is. Find the integer value of the character, and then look it up.

An important one to look out for is U+FFFD ( ) as it is sometimes used when a decoder has recieved a bunch of bytes that make no sense in the context of the encoding it is trying to use (eg 0x80 followed by 0x20 makes no sense in UTF-8, and one possible response is to use U+FFFD as a "something strange here" marker, other possible responses are throwing an error, and also silently ignoring the error or trying to guess at intent though those last two bring security issues).

Once you've this figured out, you can begin to reason about why it's getting in there if it isn't expected. Could it be an ecoding issue (charset written in is not the charset read in)? Could it be actually intended to be there? Could it be something else? You can't begin to answer that until you have more information on the bug.

Finally, there's the matter of what to do about it. This will hopefully be obvious from the answers you've found in your research above. Possibly the answer will be "nothing it's fine", possibly something simple or something hard. Can't say yet.

Do not just filter with a regular expression . Maybe that will turn out to be the correct solution, but you don't know yet, so maybe you're making a deeper bug harder to find than it is now, or damaging perfectly good data.

Define the allowed characters and block everything else, ie:

// only lowercase letters and digits
if(Regex.IsMatch(yourString, @"^[a-z0-9]*$"))
{
    // allowed
}

But I think your problem may lie somewhere else, because you say it comes from serializing (valid) string and then deserializing (invalid) strings. It is possibly that you use default serialization and that you don't apply proper ISerializable implementation for your classes (or proper use of the Serializable attributes), resulting in properties or fields being serialized that you don't want to be serialized.

PS: others have mentioned encoding issues, which is a possible cause and might mean you cannot read back the data at all. About encoding there's one simple rule: use the same encoding everywhere (streams, database, xml) and be specific . If you are not, the default encoding is used, which can be different from system to system.


Edit: possible solution

Based on new information (see thread under original question), it is pretty clear that the issue has to do with encoding. The OP mentions that it appears with dashes, which are often replaced with pretty dashes like "—" ( — ) when used in some fancy editing environment. Since it seems that there's some unclarity in how to fix SQL Server to accept proper encoded strings, you can also solve this in your XML.

When you create your XML, simply change the encoding to the most basic possible ( US-ASCII ). This will automatically force the XML writer to use the proper numerical entities. When you deserialize, this will be properly parsed in your strings without further ado. Something along these lines:

Stream stream = new MemoryStream();
XmlWriterSettings settings = new XmlWriterSettings();
settings.Encoding = Encoding.ASCII;
XmlWriter writer = XmlWriter.Create(stream, settings);
// make sure to output the xml-prolog header

But be aware of using StringBuilder or StringWriter , because it is fixed to using UTF-16, and the XmlWriter will always write in that encoding, more info on that issue at my blog , which is not compatible with SQL Server.

Note: when using the ASCII encoding, any character higher than 0x7F will be encoded. So, é will look like &#xE9 and the dash may look like &#x2014 , but this means just the same and you should not worry about that. Every XML capable tool will properly interpret this input.

Note 2: the location where you want to change the way XML is written is the Web Service you talk of, that receives XML and then stores it into the SQL Server database. Before storing into SQL Server, the change must be applied. Earlier on in the chain is useless.

Take a deeper look at the characters themselves, what are the acutal char values?

When a character shows up an a square it means you can't represent it visually. This is either because it's a non-visual character, or it's outside of your current char set.

edit, nope

In your example I'd venture a guess that your seeing imbedded newline characters.

public static T DeserializeFromXml<T>(string xml)
        {
            T result;
            XmlSerializerFactory serializerFactory = new XmlSerializerFactory();
            XmlSerializer serializer =serializerFactory.CreateSerializer(typeof(T));

            using (StringReader sr3 = new StringReader(xml))
            {
                XmlReaderSettings settings = new XmlReaderSettings()
                {
                    CheckCharacters = false // default value is true;
                };

                using (XmlReader xr3 = XmlTextReader.Create(sr3, settings))
                {
                    result = (T)serializer.Deserialize(xr3);
                }
            }

            return result;
        }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM