[英]Parse (extract) content from a html page using .net
I need to parse/extract information from an html page. 我需要从html页面解析/提取信息。 Basically what I'm doing is loading the page as a string using System.Net.WebClient and using HTML Agility Pack to get content inside html tags (forms, labels, imputs and so on). 基本上,我正在使用System.Net.WebClient并使用HTML Agility Pack将页面作为字符串加载,以在html标签(表单,标签,伪装等)中获取内容。
However, some content is inside a javascript script tag, like this: 但是,某些内容位于javascript脚本标记内,如下所示:
<script type="text/javascript">
//<![CDATA[
var itemCol = new Array();
itemCol[0] = {
pid: "01010101",
Desc: "Some desc",
avail: "Available",
price: "$10.00"
};
itemCol[1] = {
pid: "01010101",
Desc: "Some desc",
avail: "Available",
price: "$10.00"
};
//]]>
</script>
So, how could I parse it to a collection in .NET? 那么,如何将其解析为.NET中的集合? Can HTML Agility Pack help with that? HTML Agility Pack可以帮助您吗? I really appreciate any help. 我非常感谢您的帮助。
Thanks in advance. 提前致谢。
The HAP will not parse out the javascript for you - the best it will do is parse out the contents of the element. HAP不会为您解析javascript-最好的做法是解析元素的内容。
javascript.net may fit the bill. javascript.net可能适合您。
what part of the content inside the script tag do you want? 您想要脚本标记内的内容的哪一部分? What kind of collection are you expecting. 您期望什么样的收藏。 You can always select script tags using below 您始终可以使用以下方法选择脚本标签
HtmlDocument document = new HtmlDocument();
document.Load(downloadedHtml);
XPathNavigator n = document.CreateNavigator();
XPathNodeIterator scriptTags = n.Select("//script");
foreach (XPathNavigator nav in scriptTags)
{
string innerXml = nav.InnerXml;
// Parse inner xml using regex
}
using the javascript.net library you can get a collection 使用javascript.net库,您可以获得一个集合
using (JavascriptContext context = new JavascriptContext())
{
context.SetParameter("data", new MyObject());
StringBuilder s = new StringBuilder();
foreach (XPathNavigator nav in scriptTags)
{
s.Append(nav.InnerXml);
}
s.Append(";data.item = itemCol;");
context.Run(s.ToString());
MyObject o = context.GetParameter("data") as MyObject;
Then just have a datastructure like 然后只要有一个像
class MyObject
{
public object item { get; set; }
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.