简体   繁体   English

使用.net从HTML页面解析(提取)内容

[英]Parse (extract) content from a html page using .net

I need to parse/extract information from an html page. 我需要从html页面解析/提取信息。 Basically what I'm doing is loading the page as a string using System.Net.WebClient and using HTML Agility Pack to get content inside html tags (forms, labels, imputs and so on). 基本上,我正在使用System.Net.WebClient并使用HTML Agility Pack将页面作为字符串加载,以在html标签(表单,标签,伪装等)中获取内容。

However, some content is inside a javascript script tag, like this: 但是,某些内容位于javascript脚本标记内,如下所示:

<script type="text/javascript">
//<![CDATA[
var itemCol = new Array();

itemCol[0] = {
    pid: "01010101",
    Desc: "Some desc",
    avail: "Available",
    price: "$10.00"
};

itemCol[1] = {
    pid: "01010101",
    Desc: "Some desc",
    avail: "Available",
    price: "$10.00"
};

//]]>
</script>

So, how could I parse it to a collection in .NET? 那么,如何将其解析为.NET中的集合? Can HTML Agility Pack help with that? HTML Agility Pack可以帮助您吗? I really appreciate any help. 我非常感谢您的帮助。

Thanks in advance. 提前致谢。

The HAP will not parse out the javascript for you - the best it will do is parse out the contents of the element. HAP不会为您解析javascript-最好的做法是解析元素的内容。

javascript.net may fit the bill. javascript.net可能适合您。

what part of the content inside the script tag do you want? 您想要脚​​本标记内的内容的哪一部分? What kind of collection are you expecting. 您期望什么样的收藏。 You can always select script tags using below 您始终可以使用以下方法选择脚本标签

  HtmlDocument document = new HtmlDocument();
  document.Load(downloadedHtml);
  XPathNavigator n = document.CreateNavigator();
  XPathNodeIterator scriptTags = n.Select("//script");

  foreach (XPathNavigator nav in scriptTags)
  {
    string innerXml = nav.InnerXml;

    // Parse inner xml using regex
  }

using the javascript.net library you can get a collection 使用javascript.net库,您可以获得一个集合

 using (JavascriptContext context = new JavascriptContext())
  {
    context.SetParameter("data", new MyObject());

     StringBuilder s = new StringBuilder();

    foreach (XPathNavigator nav in scriptTags)
    {
       s.Append(nav.InnerXml);
    }

  s.Append(";data.item = itemCol;");
  context.Run(s.ToString());

  MyObject o = context.GetParameter("data") as MyObject;

Then just have a datastructure like 然后只要有一个像

   class MyObject
   {
     public object item { get; set; }
   }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM