简体   繁体   中英

ASP.NET Web Page Mirror, Replacing all relative URLs with absolute Paths

I'm trying to build an ASP.NET page that can crawl web pages and display them correctly with all relevant html elements edited to include absolute URLs where appropriate.

This question has been partially answered here https://stackoverflow.com/a/2719712/696638

Using a combination of the answer above and this blog post http://blog.abodit.com/2010/03/a-simple-web-crawler-in-c-using-htmlagilitypack/ I have built the following;

public partial class Crawler : System.Web.UI.Page {
    protected void Page_Load(object sender, EventArgs e) {
        Response.Clear();

        string url = Request.QueryString["path"];

        WebClient client = new WebClient();
        byte[] requestHTML = client.DownloadData(url);
        string sourceHTML = new UTF8Encoding().GetString(requestHTML);

        HtmlDocument htmlDoc = new HtmlDocument();
        htmlDoc.LoadHtml(sourceHTML);

        foreach (HtmlNode link in htmlDoc.DocumentNode.SelectNodes("//a[@href]")) {
            if (!string.IsNullOrEmpty(link.Attributes["href"].Value)) {
                HtmlAttribute att = link.Attributes["href"];
                string href = att.Value;

                // ignore javascript on buttons using a tags
                if (href.StartsWith("javascript", StringComparison.InvariantCultureIgnoreCase)) continue;

                Uri urlNext = new Uri(href, UriKind.RelativeOrAbsolute);
                if (!urlNext.IsAbsoluteUri) {
                    urlNext = new Uri(new Uri(url), urlNext);
                    att.Value = urlNext.ToString();
                }
            }
        }

        Response.Write(htmlDoc.DocumentNode.OuterHtml);

    }
}

This only replaces the href attribute for links. By expanding this I'd like to know what the most efficient way would be to include;

  • href attribute for <a> elements
  • href attribute for <link> elements
  • src attribute for <script> elements
  • src attribute for <img> elements
  • action attribute for <form> elements

And any others people can think of?

Could these be found using a single call to SelectNodes with a monster xpath or would it be more efficient to call SelectNodes multiple times and iterrate through each collection?

The following should work:

SelectNodes("//*[@href or @src or @action]")

and then you'd have to adapt the if statement below.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM