I have the following list -
List<string> finalMessageContent
where
finalMessageContent[0] = "<div class="mHr" id="mFID">
<div id="postedDate">11/12/2015 11:12:16</div>
</div>" // etc etc
I am trying to sort the list by a particular value located in the entires - postedDate
tag.
Firstly I have create an new object and then serialized it to make the html elements able to be parsed -
string[][] newfinalMessageContent = finalMessageContent.Select(x => new string[] { x }).ToArray();
string json = JsonConvert.SerializeObject(newfinalMessageContent);
JArray markerData = JArray.Parse(json);
And then used Linq to try and sort using OrderByDescending -
var items = markerData.OrderByDescending(x => x["postedDate"].ToString()).ToList();
However this is failing when trying to parse the entry with -
Accessed JArray values with invalid key value: "postedDate". Array position index expected.
Perhaps linq is not the way to go here however it seemed like the most optimised, where am I going wrong?
First, i would not use string methods, regex or a JSON-parser to parse HTML. I would use HtmlAgilityPack
. Then you could provide such a method:
private static DateTime? ExtractPostedDate(string inputHtml, string controlID = "postedDate")
{
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(inputHtml);
HtmlNode div = doc.GetElementbyId(controlID);
DateTime? result = null;
DateTime value;
if (div != null && DateTime.TryParse(div.InnerText.Trim(), DateTimeFormatInfo.InvariantInfo, DateTimeStyles.None, out value))
result = value;
return result;
}
and following LINQ query:
finalMessageContent = finalMessageContent
.Select(s => new { String = s, Date = ExtractPostedDate(s) })
.Where(x => x.Date.HasValue)
.OrderByDescending(x => x.Date.Value)
.Select(x => x.String)
.ToList();
Json Serializer serializes JSON typed strings. Example here to json
To parse HTML I suggest using HtmlAgility https://htmlagilitypack.codeplex.com/
Like this:
HtmlAgilityPack.HtmlDocument htmlparsed = new HtmlAgilityPack.HtmlDocument();
htmlParsed.LoadHtml(finalMessageContent[0]);
List<HtmlNode> OrderedDivs = htmlParsed.DocumentNode.Descendants("div").
Where(a => a.Attributes.Any(af => af.Value == "postedDate")).
OrderByDescending(d => DateTime.Parse(d.InnerText)); //unsafe parsing
Don't know if I get your question right. But did you know that you can parse HTML with XPath?
foreach (var row in doc.DocumentNode.SelectNodes("//div[@id="postedDate"]"))
{
Console.WriteLine(row.InnerText);
}
this is just an example from the top of my head you might have to double-check the XPath query depending on your document. You can also consider converting it to array or parsing the date and do other transformations with it.
Like I said this is just from the top of my head. Or if the html is not so compley consider to extract the dates with an RegEx
but this would be a topic for another question.
HTH
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.