简体   繁体   中英

XML Serialization Deserialization HTML Entities C# .net

We have some XML files which we get as input (whose format is not under our control).

<?xml version="1.0" encoding="UTF-8"?>
<GroupFile..>
    <Group id="10" desc="Description">
        <Member id="117">&#x00B0;</Member>
    </Group>    
</GroupFile>

This file can contain HTML entity code representation of symbols like "°" (represented as " &#x00B0; " in hex). This file is deserialized to Group and Member class objects. When XML deserializing the Member element value is correctly read as "°" and displayed in a grid. When serializing back the earlier objects back into XML, the Member value is saved as "°" instead of " &#x00B0; ".

Deserialization - Correct

<Member id="117">&#x00B0;</Member> deserializes into Member object with value °

Serialization - Issue here

The same Member object with value ° serializes into <Member id="117">°</Member> instead of <Member id="117">&#x00B0;</Member>

How can this be prevented and get it serialized back as " &#x00B0; " ?

You must then apply a custom serialization/deserializatio n to do so.

Using HttpUtility.HtmlEncode/HtmlDecode is not sufficient since it provide the decimal encoding . I added the following ( could be improved in terms of error catching ) to keep the hex escaped characters in the xml serialization.

Update: In order to avoid automatic escape of special character, you must write a custom Xml serializer for the class as seen below and use WriteRaw

If you use the XmlSerializer:

public class GroupFile
{
    [XmlElement("Group")]
    public Group[] Groups { get; set; }
}

public class Group
{
    [XmlAttribute("id")]
    public int Id { get; set; }

    [XmlElement("Member")]
    public Member[] Members { get; set; }
}

[Serializable]
public class Member : IXmlSerializable
{

    public static string DecimalToHexadecimalEncoding(string html)
    {
        var splitted = html.Split('#');
        var res = Int32.Parse(splitted[1].Replace(";", string.Empty));
        return "&#x" + res.ToString("x4") + ";";
    }

    [XmlAttribute("id")]
    public int Id { get; set; }       

    [XmlIgnore]
    public string Value { get; set; }

    [XmlText]
    public string HexValue
    {
        get
        {
            // convert to hex representation
            var res = HttpUtility.HtmlEncode(Value);
            res = DecimalToHexadecimalEncoding(res);
            return res;
        }
    }

    public XmlSchema GetSchema()
    {
        return null;
    }

    public void ReadXml(XmlReader reader)
    {
        var attributeValue = reader.GetAttribute("id");
        if (attributeValue != null)
        {
            Id = Int32.Parse(attributeValue);
        }
        // Here the value is directly converted to string "°"
        Value = reader.ReadElementString();            
        reader.ReadEndElement();           
    }

    public void WriteXml(XmlWriter writer)
    {
        writer.WriteAttributeString("id", Id.ToString());
        writer.WriteRaw(HexValue);
    }
}

You can use HSharp to deserialize HTML. HSharp is a library used to analyse markup language like HTML easily and fastly. Install: Install-Package Obisoft.HSharp

var NewDocument = HtmlConvert.DeserializeHtml($@"
<html>
<head>
    <meta charset={"\"utf-8\""}>
    <meta name={"\"viewport\""}>
    <title>Example</title>
</head>
<body>
<h1>Some Text</h1>
<table>
    <tr>OneLine</tr>
    <tr>TwoLine</tr>
    <tr>ThreeLine</tr>
</table>
</body>
</html>");

Console.WriteLine(NewDocument["html"]["head"]["meta",0].Properties["charset"]);
Console.WriteLine(NewDocument["html"]["head"]["meta",1].Properties["name"]);
foreach (var Line in NewDocument["html"]["body"]["table"])
{
    Console.WriteLine(Line.Son);
}

That will output:

utf-8
viewport
OneLine
TwoLine
ThreeLine

and you can also foreach the tag in html.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM