We have some XML files which we get as input (whose format is not under our control).
<?xml version="1.0" encoding="UTF-8"?>
<GroupFile..>
<Group id="10" desc="Description">
<Member id="117">°</Member>
</Group>
</GroupFile>
This file can contain HTML entity code representation of symbols like "°" (represented as " °
" in hex). This file is deserialized to Group and Member class objects. When XML deserializing the Member element value is correctly read as "°" and displayed in a grid. When serializing back the earlier objects back into XML, the Member value is saved as "°" instead of " °
".
Deserialization - Correct
<Member id="117">°</Member>
deserializes into Member object with value °
Serialization - Issue here
The same Member object with value ° serializes into <Member id="117">°</Member>
instead of <Member id="117">°</Member>
How can this be prevented and get it serialized back as " °
" ?
You must then apply a custom serialization/deserializatio n to do so.
Using HttpUtility.HtmlEncode/HtmlDecode
is not sufficient since it provide the decimal encoding . I added the following ( could be improved in terms of error catching ) to keep the hex escaped characters in the xml serialization.
Update: In order to avoid automatic escape of special character, you must write a custom Xml serializer for the class as seen below and use WriteRaw
If you use the XmlSerializer:
public class GroupFile
{
[XmlElement("Group")]
public Group[] Groups { get; set; }
}
public class Group
{
[XmlAttribute("id")]
public int Id { get; set; }
[XmlElement("Member")]
public Member[] Members { get; set; }
}
[Serializable]
public class Member : IXmlSerializable
{
public static string DecimalToHexadecimalEncoding(string html)
{
var splitted = html.Split('#');
var res = Int32.Parse(splitted[1].Replace(";", string.Empty));
return "&#x" + res.ToString("x4") + ";";
}
[XmlAttribute("id")]
public int Id { get; set; }
[XmlIgnore]
public string Value { get; set; }
[XmlText]
public string HexValue
{
get
{
// convert to hex representation
var res = HttpUtility.HtmlEncode(Value);
res = DecimalToHexadecimalEncoding(res);
return res;
}
}
public XmlSchema GetSchema()
{
return null;
}
public void ReadXml(XmlReader reader)
{
var attributeValue = reader.GetAttribute("id");
if (attributeValue != null)
{
Id = Int32.Parse(attributeValue);
}
// Here the value is directly converted to string "°"
Value = reader.ReadElementString();
reader.ReadEndElement();
}
public void WriteXml(XmlWriter writer)
{
writer.WriteAttributeString("id", Id.ToString());
writer.WriteRaw(HexValue);
}
}
You can use HSharp to deserialize HTML. HSharp is a library used to analyse markup language like HTML easily and fastly. Install: Install-Package Obisoft.HSharp
var NewDocument = HtmlConvert.DeserializeHtml($@"
<html>
<head>
<meta charset={"\"utf-8\""}>
<meta name={"\"viewport\""}>
<title>Example</title>
</head>
<body>
<h1>Some Text</h1>
<table>
<tr>OneLine</tr>
<tr>TwoLine</tr>
<tr>ThreeLine</tr>
</table>
</body>
</html>");
Console.WriteLine(NewDocument["html"]["head"]["meta",0].Properties["charset"]);
Console.WriteLine(NewDocument["html"]["head"]["meta",1].Properties["name"]);
foreach (var Line in NewDocument["html"]["body"]["table"])
{
Console.WriteLine(Line.Son);
}
That will output:
utf-8
viewport
OneLine
TwoLine
ThreeLine
and you can also foreach the tag in html.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.