简体   繁体   English

ASP.NET网页镜像,用绝对路径替换所有相对URL

[英]ASP.NET Web Page Mirror, Replacing all relative URLs with absolute Paths

I'm trying to build an ASP.NET page that can crawl web pages and display them correctly with all relevant html elements edited to include absolute URLs where appropriate. 我正在尝试构建一个ASP.NET页面,该页面可以爬网网页并正确编辑所有相关html元素以在适当的地方包括绝对URL来正确显示它们。

This question has been partially answered here https://stackoverflow.com/a/2719712/696638 这个问题在这里已经部分回答https://stackoverflow.com/a/2719712/696638

Using a combination of the answer above and this blog post http://blog.abodit.com/2010/03/a-simple-web-crawler-in-c-using-htmlagilitypack/ I have built the following; 通过结合以上答案和此博客帖子http://blog.abodit.com/2010/03/a-simple-web-crawler-in-c-using-htmlagilitypack/,我构建了以下内容;

public partial class Crawler : System.Web.UI.Page {
    protected void Page_Load(object sender, EventArgs e) {
        Response.Clear();

        string url = Request.QueryString["path"];

        WebClient client = new WebClient();
        byte[] requestHTML = client.DownloadData(url);
        string sourceHTML = new UTF8Encoding().GetString(requestHTML);

        HtmlDocument htmlDoc = new HtmlDocument();
        htmlDoc.LoadHtml(sourceHTML);

        foreach (HtmlNode link in htmlDoc.DocumentNode.SelectNodes("//a[@href]")) {
            if (!string.IsNullOrEmpty(link.Attributes["href"].Value)) {
                HtmlAttribute att = link.Attributes["href"];
                string href = att.Value;

                // ignore javascript on buttons using a tags
                if (href.StartsWith("javascript", StringComparison.InvariantCultureIgnoreCase)) continue;

                Uri urlNext = new Uri(href, UriKind.RelativeOrAbsolute);
                if (!urlNext.IsAbsoluteUri) {
                    urlNext = new Uri(new Uri(url), urlNext);
                    att.Value = urlNext.ToString();
                }
            }
        }

        Response.Write(htmlDoc.DocumentNode.OuterHtml);

    }
}

This only replaces the href attribute for links. 这仅替换链接的href属性。 By expanding this I'd like to know what the most efficient way would be to include; 通过扩展这一点,我想知道最有效的方式是什么;

  • href attribute for <a> elements <a>元素的href属性
  • href attribute for <link> elements <link>元素的href属性
  • src attribute for <script> elements <script>元素的src属性
  • src attribute for <img> elements <img>元素的src属性
  • action attribute for <form> elements <form>元素的action属性

And any others people can think of? 还有其他人可以想到的吗?

Could these be found using a single call to SelectNodes with a monster xpath or would it be more efficient to call SelectNodes multiple times and iterrate through each collection? 是否可以通过使用带有怪兽xpath的SelectNodes调用一次来SelectNodes ,还是多次调用SelectNode并遍历每个集合更有效?

The following should work: 以下应该有效:

SelectNodes("//*[@href or @src or @action]")

and then you'd have to adapt the if statement below. 然后您必须修改以下if语句。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM