简体   繁体   English

将相对网址替换为绝对网址

[英]Replace relative urls to absolute

I have the html source of a page in a form of string with me: 我有一个字符串形式的页面的html源:

<html>
    <head>
          <link rel="stylesheet" type="text/css" href="/css/all.css" /> 
    </head>
    <body>
        <a href="/test.aspx">Test</a>
        <a href="http://mysite.com">Test</a>
        <img src="/images/test.jpg"/>
        <img src="http://mysite.com/images/test.jpg"/>
    </body>
</html>

I want to convert all the relative paths to absolute. 我想将所有相对路径转换为绝对路径。 I want the output be: 我希望输出是:

<html>
    <head>
          <link rel="stylesheet" type="text/css" href="http://mysite.com/css/all.css" /> 
    </head>
    <body>
        <a href="http://mysite.com/test.aspx">Test</a>
        <a href="http://mysite.com">Test</a>
        <img src="http://mysite.com/images/test.jpg"/>
        <img src="http://mysite.com/images/test.jpg"/>
    </body>
</html>

Note: I want only the relative paths to be converted to absolute ones in that string . 注意:我只希望在string中将相对路径转换为绝对路径。 The absolute ones which are already in that string should not be touched, they are fine to me as they are already absolute. 该字符串中已经存在的绝对值不应该被触及,它们对我来说很好,因为它们已经是绝对值了。 Can this be done by regex or other means? 可以通过正则表达式或其他方式来完成吗?

Don't try to parse html with regex as expained here https://stackoverflow.com/a/1732454/932418 and https://stackoverflow.com/a/1758162/932418 不要尝试使用正则表达式解析html,如此处https://stackoverflow.com/a/1732454/932418https://stackoverflow.com/a/1758162/932418所述

Use an html parser like HtmlAgilityPack instead 使用类似HtmlAgilityPack的html解析器

string html = 
@"<html>
    <head>
            <link rel=""stylesheet"" type=""text/css"" href=""/css/all.css"" /> 
    </head>
    <body>
        <a href=""/test.aspx"">Test</a>
        <a href=""http://example.com"">Test</a>
        <img src=""/images/test.jpg""/>
        <img src=""http://example.com/images/test.jpg""/>
    </body>
</html>";

StringWriter writer = new StringWriter();
string baseUrl= "http://example.com";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);

foreach(var img in doc.DocumentNode.Descendants("img"))
{
    img.Attributes["src"].Value = new Uri(new Uri(baseUrl), img.Attributes["src"].Value).AbsoluteUri;
}

foreach (var a in doc.DocumentNode.Descendants("a"))
{
    a.Attributes["href"].Value = new Uri(new Uri(baseUrl), a.Attributes["href"].Value).AbsoluteUri;
}

doc.Save(writer);

string newHtml = writer.ToString();

Add

<base href="http://mysite.com/images/" />

To the head of the page 到页面顶部

Check this out, it could help you. 检查一下,它可以为您提供帮助。

It is in the following format: http(s)://domain(:port)/AppPath ) 它采用以下格式: http(s):// domain(:port)/ AppPath

HttpContext.Current.Request.Url.Scheme + "://" + HttpContext.Current.Request.Url.Authority + HttpContext.Current.Request.ApplicationPath;

Or you could use: 或者您可以使用:

Page.ResolveUrl("img/youFile");

Use regular expressions for this. 为此使用正则表达式。 Here is short example 这是一个简短的例子

static void Main(string[] args)
    {
        string input = "<html>\n<head>\n<link rel=\"stylesheet\" type=\"text/css\" href=\"/css/all.css\" /> \n</head>\n<body>\n<a href=\"/test.aspx\">Test</a>\n<a href=\"http://mysite.com\">Test</a>\n<img src=\"/images/test.jpg\"/>\n<img src=\"http://mysite.com/images/test.jpg\"/>\n</body>\n</html>";
        string pattern = "((?:src|href)[\\s]*?)(?:\\=[\\s]*?[\\\"\\\'])[\\/*\\\\*]?(?!..+[s]?\\:[\\/]*)(.*?)(?:[\\s\\\"\\\'])";
        var reg = new Regex(pattern, RegexOptions.IgnoreCase);
        string prefix = @"http://mysite.com";
        var result = reg.Replace(input, "$1=\""+prefix+"$2\"");
    }

the result is 结果是

<html>
<head>
<link rel="stylesheet" type="text/css" href="http://mysite.com/css/all.css" /> 
</head>
<body>
<a href="http://mysite.com/test.aspx">Test</a>
<a href="http://mysite.com">Test</a>
<img src="http://mysite.com/images/test.jpg"/>
<img src="http://mysite.com/images/test.jpg"/>
</body>
</html>

Look at this function: 看一下这个功能:

Private Function ConvertALLrelativeLinksToAbsoluteUri(ByVal html As String, ByVal PageURL As String)
    Dim result As String = Nothing
    ' Getting all Href
    Dim opt As New RegexOptions
    Dim XpHref As New Regex("(href="".*?"")", RegexOptions.IgnoreCase)
    Dim i As Integer
    Dim NewSTR As String = html
    For i = 0 To XpHref.Matches(html).Count - 1
        Application.DoEvents()
        Dim Oldurl As String = Nothing
        Dim OldHREF As String = Nothing
        Dim MainURL As New Uri(PageURL)
        OldHREF = XpHref.Matches(html).Item(i).Value
        Oldurl = OldHREF.Replace("href=", "").Replace("HREF=", "").Replace("""", "")
        Dim NEWURL As New Uri(MainURL, Oldurl)
        Dim NewHREF As String = "href=""" & NEWURL.AbsoluteUri & """"
        NewSTR = NewSTR.Replace(OldHREF, NewHREF)
    Next
    html = NewSTR
    Dim XpSRC As New Regex("(src="".*?"")", RegexOptions.IgnoreCase)
    For i = 0 To XpSRC.Matches(html).Count - 1
        Application.DoEvents()
        Dim Oldurl As String = Nothing
        Dim OldHREF As String = Nothing
        Dim MainURL As New Uri(PageURL)
        OldHREF = XpSRC.Matches(html).Item(i).Value
        Oldurl = OldHREF.Replace("src=", "").Replace("src=", "").Replace("""", "")
        Dim NEWURL As New Uri(MainURL, Oldurl)
        Dim NewHREF As String = "src=""" & NEWURL.AbsoluteUri & """"
        NewSTR = NewSTR.Replace(OldHREF, NewHREF)
    Next
    Return NewSTR
End Function

This works great for me. 这对我来说很棒。 I uses it on email templates. 我在电子邮件模板上使用它。 I'm using the MVC/Razor "~/" at the beginning of each link. 我在每个链接的开头都使用MVC / Razor“〜/”。

' Parse HTML and make relative links absolute with p_basepath
Public Function ParseHTMLLinks(ByVal MailBodyHTML As String) As String
    ' Declare & intialize variables
    Dim strHTMLBody As String = MailBodyHTML

    ' Set regex variables 
    Dim strSrcSubMatch As String = ""
    Dim strSrcFullUrl As String = ""
    Dim srcPattern As String = "[=""]\/?([^""\s]*(\.gif|\.jpg|\.jpeg|\.png|\.css|\.js))[""\s]"
    Dim srcOptions As RegexOptions = RegexOptions.IgnoreCase
    Dim regex As Regex = New Regex(srcPattern, srcOptions)
    Dim regexSub As Regex = New Regex(srcPattern, srcOptions)
    Dim Matches As MatchCollection = regex.Matches(strHTMLBody)

    Try
        For Each Match As Match In Matches
            ' filter out absolute links
            If InStr(Match.ToString, "://") = 0 And InStr(LCase(Match.ToString), "mailto:") = 0 And InStr(LCase(Match.ToString), "javascript:") = 0 Then
                ' Remove the " at each end of relative path
                strSrcSubMatch = regexSub.Replace(Match.ToString, "$1")
                ' Concatenate the FullPath
                strSrcFullUrl = p_basePath & strSrcSubMatch
                ' Execute the replace
                strHTMLBody = Replace(strHTMLBody, "/" & strSrcSubMatch, strSrcFullUrl)
            End If
        Next

    Catch e As WebException
        'Add errors to List(Of WebException), if any.
        ErrorCodes.Add(e)
    End Try

    Return strHTMLBody 'MailBodyHTML
End Function

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM