簡體   English   中英

反序列化之前從XML文件中刪除無效字符

[英]Removing Invalid Characters From XML File Before Deserialization

我從服務器收到一些XML,這些XML有時具有一些無效字符,我想在反序列化之前將其刪除。 我無法控制收到的XML文件,因此我需要自己檢查無效字符。

XML示例.....

<PrintStatus>N</PrintStatus>
<CustomerPO> >>>> pearl <<<<< </CustomerPO>
<Description>PO# pearl</Description>
<BranchID>4</BranchID>
<PostDate>
   <Date>01/13/2015</Date>
</PostDate>
<ShipDate>
   <Date>01/13/2015</Date>
</ShipDate>

如您所見,customer po部分包含需要刪除的無效字符。 有時僅在某些包含用戶鍵入數據的元素中會發生這種情況。

這是我的回復代碼.....

//configure http request
HttpWebRequest httpRequest = WebRequest.Create(url) as HttpWebRequest;
httpRequest.Method = "POST";

//prepare correct encoding for XML serialization
UTF8Encoding encoding = new UTF8Encoding();

//use Xml property to obtain serialized XML data
//convert into bytes using encoding specified above and get length
byte[] bodyBytes = encoding.GetBytes(Xml);
httpRequest.ContentLength = bodyBytes.Length;

//get http request stream for putting XML data into
Stream httpRequestBodyStream = httpRequest.GetRequestStream();

//fill stream with serialized XML data
httpRequestBodyStream.Write(bodyBytes, 0, bodyBytes.Length);
httpRequestBodyStream.Close();

//get http response
HttpWebResponse httpResponse = httpRequest.GetResponse() as HttpWebResponse;
StreamReader httpResponseStream = new StreamReader(httpResponse.GetResponseStream(), System.Text.Encoding.ASCII);

//extract XML from response
string httpResponseBody = httpResponseStream.ReadToEnd();
httpResponseStream.Close();

//ignore everything that isn't XML by removing headers
httpResponseBody = httpResponseBody.Substring(httpResponseBody.IndexOf("<?xml"));

//deserialize XML into ProductInquiryResponse
XmlSerializer serializer = new XmlSerializer(typeof(MyResponseClass));
StringReader responseReader = new StringReader(httpResponseBody);

//return MyResponseClass result
return serializer.Deserialize(responseReader) as MyResponseClass;

是否有人碰巧建議檢查XML? 我應該只在xml字符串反序列化之前檢查我關注的元素嗎? 或者,還有更好的方法?

解決該問題的一般方法是遞歸地降級XML,在進行過程中進行解析,然后與該節點的模式進行比較。 在任何時候,如果輸入與模式中預期的輸入不同或格式錯誤,請允許錯誤處理程序運行以修復輸入流,回滾到最新的良好狀態,然后繼續進行固定的輸入。

.Net XmlTextReader類不夠靈活,無法做到這一點。 但是,如果您事先從架構中知道某些XML元素不能具有子元素,則以下內容將讀取XML輸入流,並在遇到其完全限定名稱與葉節點的已知名稱匹配的元素時進行“轉義”所有此類節點的文本:

public enum XmlDoctorStatus
{
    NoFixNeeded,
    FixMade,
    FixFailed
}

public class XmlDoctor
{
    internal class XmlFixData
    {
        public string InitialXml { get; private set; }
        public string FixedXml { get; private set; }
        public int LineNumber { get; private set; }
        public int LinePosition { get; private set; }

        public XmlFixData(string initialXml, string fixedXml, int lineNumber, int linePosition)
        {
            this.InitialXml = initialXml;
            this.FixedXml = fixedXml;
            this.LineNumber = lineNumber;
            this.LinePosition = linePosition;
        }

        public bool ComesAfter(XmlFixData other)
        {
            if (LineNumber > other.LineNumber)
                return true;
            if (LineNumber == other.LineNumber && LinePosition > other.LinePosition)
                return true;
            return false;
        }
    }

    internal class XmlFixedException : Exception
    {
        public XmlFixData XmlFixData { get; private set; }

        public XmlFixedException(XmlFixData data)
        {
            this.XmlFixData = data;
        }
    }

    readonly HashSet<XName> childlessNodes;

    public string OriginalXml { get; private set; }

    public XmlDoctor(string xml, IEnumerable<XName> childlessNodes)
    {
        if (xml == null)
            throw new ArgumentNullException();
        this.OriginalXml = xml;
        this.childlessNodes = new HashSet<XName>(childlessNodes);
    }

    List<int> indices = null;
    string passXml = string.Empty;
    bool inPass = false;

    void InitializePass(string xml)
    {
        if (inPass)
            throw new Exception("nested pass");
        ClearElementData();
        TextHelper.NormalizeLines(xml, out passXml, out indices);
        inPass = true;
    }

    void EndPass()
    {
        inPass = false;
        indices = null;
        passXml = string.Empty;
        ClearElementData();
    }

    static int LineNumber(XmlReader reader)
    {
        return ((IXmlLineInfo)reader).LineNumber;
    }

    static int LinePosition(XmlReader reader)
    {
        return ((IXmlLineInfo)reader).LinePosition;
    }

    // Taken from https://stackoverflow.com/questions/1132494/string-escape-into-xml

    public static string XmlEscape(string escaped)
    {
        var replacements = new KeyValuePair<string, string>[]
        {
            new KeyValuePair<string,string>("&", "&amp;"),
            new KeyValuePair<string,string>("\"", "&quot;"),
            new KeyValuePair<string,string>("'", "&apos;"),
            new KeyValuePair<string,string>("<", "&lt;"),
            new KeyValuePair<string,string>(">", "&gt;"),
        };

        foreach (var pair in replacements)
            foreach (var index in escaped.IndexesOf(pair.Key, 0).Reverse())
                if (!replacements.Any(other => string.Compare(other.Value, 0, escaped, index, other.Value.Length, StringComparison.Ordinal) == 0))
                {
                    escaped = escaped.Substring(0, index) + pair.Value + escaped.Substring(index + 1, escaped.Length - index - 1);
                }
        return escaped;
    }

    void HandleNode(XmlReader reader)
    {
        // Adapted from http://blogs.msdn.com/b/mfussell/archive/2005/02/12/371546.aspx
        if (reader == null)
        {
            throw new ArgumentNullException("reader");
        }

        switch (reader.NodeType)
        {
            case XmlNodeType.Element:
                HandleStartElement(reader);
                if (reader.IsEmptyElement)
                {
                    HandleEndElement(reader);
                }
                break;
            case XmlNodeType.Text:
                HandleText(reader);
                break;
            case XmlNodeType.Whitespace:
            case XmlNodeType.SignificantWhitespace:
                break;
            case XmlNodeType.CDATA:
                break;
            case XmlNodeType.EntityReference:
                break;
            case XmlNodeType.XmlDeclaration:
            case XmlNodeType.ProcessingInstruction:
                break;
            case XmlNodeType.DocumentType:
                break;
            case XmlNodeType.Comment:
                break;
            case XmlNodeType.EndElement:
                HandleEndElement(reader);
                break;
        }
    }

    private void HandleText(XmlReader reader)
    {
        if (string.IsNullOrEmpty(currentElementLocalName) || string.IsNullOrEmpty(currentElementName))
            return;
        var name = XName.Get(currentElementLocalName, currentElementNameSpace);
        if (!childlessNodes.Contains(name))
            return;
        var lineIndex = LineNumber(reader) - 1;
        var charIndex = LinePosition(reader) - 1;
        if (lineIndex < 0 || charIndex < 0)
            return;

        int startIndex = indices[lineIndex] + charIndex;

        // Scan forward in the input string until we find either the beginning of a CDATA section or the end of this element.
        // Patterns to match:  </Name
        // 
        string pattern1 = "</" + currentElementName;
        var index1 = FindElementEnd(passXml, startIndex, pattern1);
        if (index1 < 0)
            return;  // BAD XML.
        string pattern2 = "<![CDATA[";
        var index2 = passXml.IndexOf(pattern2, startIndex);
        int endIndex = (index2 < 0 ? index1 : Math.Min(index1, index2));
        var text = passXml.Substring(startIndex, endIndex - startIndex);

        var escapeText = XmlEscape(text);
        if (escapeText != text)
        {
            if (escapeText != XmlEscape(escapeText))
            {
                Debug.Assert(escapeText == XmlEscape(escapeText));
                throw new InvalidOperationException("Escaping error");
            }
            string fixedXml = passXml.Substring(0, startIndex) + escapeText + passXml.Substring(endIndex, passXml.Length - endIndex);
            throw new XmlFixedException(new XmlFixData(passXml, fixedXml, lineIndex + 1, charIndex + 1));
        }
    }

    static bool IsXmlSpace(char ch)
    {
        // http://www.w3.org/TR/2000/REC-xml-20001006#NT-S
        // [3]      S      ::=      (#x20 | #x9 | #xD | #xA)+
        return ch == '\u0020' || ch == '\u0009' || ch == '\u000D' || ch == '\u000A';
    }

    private static int FindElementEnd(string passXml, int charPos, string tagEnd)
    {
        while (true)
        {
            var index = passXml.IndexOf(tagEnd, charPos);
            if (index < 0)
                return index;
            int endPos = index + tagEnd.Length;
            if (index + tagEnd.Length >= passXml.Length)
                return -1; // Bad xml?
            // Now we must have zero or more white space characters and a ">"
            while (endPos < passXml.Length && IsXmlSpace(passXml[endPos]))
                endPos++;
            if (endPos >= passXml.Length)
                return -1; // BAD XML;
            if (passXml[endPos] == '>')
                return index;
            index = endPos;
            // Spurious ending, keep searching.
        }
    }

    string currentElementName = string.Empty;
    string currentElementNameSpace = string.Empty;
    string currentElementLocalName = string.Empty;

    private void HandleStartElement(XmlReader reader)
    {
        currentElementName = reader.Name;
        currentElementLocalName = reader.LocalName;
        currentElementNameSpace = reader.NamespaceURI;
    }

    private void HandleEndElement(XmlReader reader)
    {
        ClearElementData();
    }

    private void ClearElementData()
    {
        currentElementName = string.Empty;
        currentElementNameSpace = string.Empty;
        currentElementLocalName = string.Empty;
    }

    public XmlDoctorStatus TryFix(out string newXml)
    {
        XmlFixData data = null;

        while (true)
        {
            XmlFixData newData;
            var status = TryFixOnePass((data == null ? OriginalXml : data.FixedXml), out newData);
            switch (status)
            {
                case XmlDoctorStatus.FixFailed:
                    Debug.WriteLine("Could not fix XML");
                    newXml = OriginalXml;
                    return XmlDoctorStatus.FixFailed;

                case XmlDoctorStatus.FixMade:
                    if (data != null && !newData.ComesAfter(data))
                    {
                        Debug.WriteLine("Warning -- possible infinite loop detected, aborting fix");
                        newXml = OriginalXml;
                        return XmlDoctorStatus.FixFailed;
                    }
                    data = newData;
                    break;  // Try to fix more

                case XmlDoctorStatus.NoFixNeeded:
                    if (data == null)
                    {
                        newXml = OriginalXml;
                        return XmlDoctorStatus.NoFixNeeded;
                    }
                    else
                    {
                        newXml = data.FixedXml;
                        return XmlDoctorStatus.FixMade;
                    }
            }
        }
    }

    XmlDoctorStatus TryFixOnePass(string xml, out XmlFixData data)
    {
        try
        {
            InitializePass(xml);

            using (var textReader = new StringReader(passXml))
            using (XmlReader reader = XmlReader.Create(textReader))
            {
                while (true)
                {
                    bool read = reader.Read();
                    if (!read)
                        break;
                    HandleNode(reader);
                }
            }
        }
        catch (XmlFixedException ex)
        {
            // Success - a fix was made.
            data = ex.XmlFixData;
            return XmlDoctorStatus.FixMade;
        }
        catch (Exception ex)
        {
            // Failure - the file was not fixed and could not be parsed.
            Debug.WriteLine("Fix Failed: " + ex.ToString());
            data = null;
            return XmlDoctorStatus.FixFailed;
        }
        finally
        {
            EndPass();
        }
        // No fix needed.
        data = null;
        return XmlDoctorStatus.NoFixNeeded;
    }
}

public static class TextHelper
{
    public static void NormalizeLines(string text, out string newText, out List<int> lineIndices)
    {
        var sb = new StringBuilder();
        var indices = new List<int>();

        using (var sr = new StringReader(text))
        {
            string line;
            while ((line = sr.ReadLine()) != null)
            {
                indices.Add(sb.Length);
                sb.AppendLine(line);
            }
        }

        lineIndices = indices;
        newText = sb.ToString();
    }

    public static IEnumerable<int> IndexesOf(this string str, string value, int startAt)
    {
        if (str == null)
            yield break;
        for (int index = startAt, valueLength = value.Length; ; index += valueLength)
        {
            index = str.IndexOf(value, index);
            if (index == -1)
                break;
            yield return index;
        }
    }
}

然后像這樣使用它:

public static class TestXmlDoctor
{
    public static void TestFix()
    {
        string xml1 = @"<?xml version=""1.0"" encoding=""UTF-8""?>
<MainClass>
<PrintStatus>N</PrintStatus>
<CustomerPO> >>>> pearl <<<<< </CustomerPO>
<Description>PO# pearl</Description>
<BranchID>4</BranchID>
<PostDate>
   <Date>01/13/2015</Date>
</PostDate>
<ShipDate>
   <Date>01/13/2015</Date>
</ShipDate>
</MainClass>
";
        XName[] childlessNodes1 = new XName[] 
        {
            XName.Get("CustomerPO", string.Empty),
        };
        try
        {
            TestFix(xml1, childlessNodes1);
        }
        catch (Exception ex)
        {
            Debug.WriteLine(ex);
        }
    }

    public static string TestFix(string xml, IEnumerable<XName> childlessNodes)
    {
        string fixedXml;
        var status = (new XmlDoctor(xml, childlessNodes).TryFix(out fixedXml));
        switch (status)
        {
            case XmlDoctorStatus.NoFixNeeded:
                return xml;
            case XmlDoctorStatus.FixFailed:
                Debug.WriteLine("Failed to fix xml");
                return xml;
            case XmlDoctorStatus.FixMade:
                Debug.WriteLine("Fixed XML, new XML is as follows:");
                Debug.WriteLine(fixedXml);
                Debug.WriteLine(string.Empty);
                return fixedXml;
            default:
                Debug.Assert(false, "Unknown fix status " + status.ToString());
                return xml;
        }
    }
}

這樣,您的XML片段可以被解析,並變成:

<?xml version="1.0" encoding="UTF-8"?>
<MainClass>
<PrintStatus>N</PrintStatus>
<CustomerPO> &gt;&gt;&gt;&gt; pearl &lt;&lt;&lt;&lt;&lt; </CustomerPO>
<Description>PO# pearl</Description>
<BranchID>4</BranchID>
<PostDate>
   <Date>01/13/2015</Date>
</PostDate>
<ShipDate>
   <Date>01/13/2015</Date>
</ShipDate>
</MainClass>

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM