[英]Replace relative urls to absolute
I have the html source of a page in a form of string with me: 我有一个字符串形式的页面的html源:
<html>
<head>
<link rel="stylesheet" type="text/css" href="/css/all.css" />
</head>
<body>
<a href="/test.aspx">Test</a>
<a href="http://mysite.com">Test</a>
<img src="/images/test.jpg"/>
<img src="http://mysite.com/images/test.jpg"/>
</body>
</html>
I want to convert all the relative paths to absolute. 我想将所有相对路径转换为绝对路径。 I want the output be:
我希望输出是:
<html>
<head>
<link rel="stylesheet" type="text/css" href="http://mysite.com/css/all.css" />
</head>
<body>
<a href="http://mysite.com/test.aspx">Test</a>
<a href="http://mysite.com">Test</a>
<img src="http://mysite.com/images/test.jpg"/>
<img src="http://mysite.com/images/test.jpg"/>
</body>
</html>
Note: I want only the relative paths to be converted to absolute ones in that string . 注意:我只希望在string中将相对路径转换为绝对路径。 The absolute ones which are already in that string should not be touched, they are fine to me as they are already absolute.
该字符串中已经存在的绝对值不应该被触及,它们对我来说很好,因为它们已经是绝对值了。 Can this be done by regex or other means?
可以通过正则表达式或其他方式来完成吗?
Don't try to parse html with regex as expained here https://stackoverflow.com/a/1732454/932418 and https://stackoverflow.com/a/1758162/932418 不要尝试使用正则表达式解析html,如此处https://stackoverflow.com/a/1732454/932418和https://stackoverflow.com/a/1758162/932418所述
Use an html parser like HtmlAgilityPack instead 使用类似HtmlAgilityPack的html解析器
string html =
@"<html>
<head>
<link rel=""stylesheet"" type=""text/css"" href=""/css/all.css"" />
</head>
<body>
<a href=""/test.aspx"">Test</a>
<a href=""http://example.com"">Test</a>
<img src=""/images/test.jpg""/>
<img src=""http://example.com/images/test.jpg""/>
</body>
</html>";
StringWriter writer = new StringWriter();
string baseUrl= "http://example.com";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
foreach(var img in doc.DocumentNode.Descendants("img"))
{
img.Attributes["src"].Value = new Uri(new Uri(baseUrl), img.Attributes["src"].Value).AbsoluteUri;
}
foreach (var a in doc.DocumentNode.Descendants("a"))
{
a.Attributes["href"].Value = new Uri(new Uri(baseUrl), a.Attributes["href"].Value).AbsoluteUri;
}
doc.Save(writer);
string newHtml = writer.ToString();
Add 加
<base href="http://mysite.com/images/" />
To the head of the page 到页面顶部
Check this out, it could help you. 检查一下,它可以为您提供帮助。
It is in the following format: http(s)://domain(:port)/AppPath ) 它采用以下格式: http(s):// domain(:port)/ AppPath )
HttpContext.Current.Request.Url.Scheme + "://" + HttpContext.Current.Request.Url.Authority + HttpContext.Current.Request.ApplicationPath;
Or you could use: 或者您可以使用:
Page.ResolveUrl("img/youFile");
Use regular expressions for this. 为此使用正则表达式。 Here is short example
这是一个简短的例子
static void Main(string[] args)
{
string input = "<html>\n<head>\n<link rel=\"stylesheet\" type=\"text/css\" href=\"/css/all.css\" /> \n</head>\n<body>\n<a href=\"/test.aspx\">Test</a>\n<a href=\"http://mysite.com\">Test</a>\n<img src=\"/images/test.jpg\"/>\n<img src=\"http://mysite.com/images/test.jpg\"/>\n</body>\n</html>";
string pattern = "((?:src|href)[\\s]*?)(?:\\=[\\s]*?[\\\"\\\'])[\\/*\\\\*]?(?!..+[s]?\\:[\\/]*)(.*?)(?:[\\s\\\"\\\'])";
var reg = new Regex(pattern, RegexOptions.IgnoreCase);
string prefix = @"http://mysite.com";
var result = reg.Replace(input, "$1=\""+prefix+"$2\"");
}
the result is 结果是
<html>
<head>
<link rel="stylesheet" type="text/css" href="http://mysite.com/css/all.css" />
</head>
<body>
<a href="http://mysite.com/test.aspx">Test</a>
<a href="http://mysite.com">Test</a>
<img src="http://mysite.com/images/test.jpg"/>
<img src="http://mysite.com/images/test.jpg"/>
</body>
</html>
Look at this function: 看一下这个功能:
Private Function ConvertALLrelativeLinksToAbsoluteUri(ByVal html As String, ByVal PageURL As String)
Dim result As String = Nothing
' Getting all Href
Dim opt As New RegexOptions
Dim XpHref As New Regex("(href="".*?"")", RegexOptions.IgnoreCase)
Dim i As Integer
Dim NewSTR As String = html
For i = 0 To XpHref.Matches(html).Count - 1
Application.DoEvents()
Dim Oldurl As String = Nothing
Dim OldHREF As String = Nothing
Dim MainURL As New Uri(PageURL)
OldHREF = XpHref.Matches(html).Item(i).Value
Oldurl = OldHREF.Replace("href=", "").Replace("HREF=", "").Replace("""", "")
Dim NEWURL As New Uri(MainURL, Oldurl)
Dim NewHREF As String = "href=""" & NEWURL.AbsoluteUri & """"
NewSTR = NewSTR.Replace(OldHREF, NewHREF)
Next
html = NewSTR
Dim XpSRC As New Regex("(src="".*?"")", RegexOptions.IgnoreCase)
For i = 0 To XpSRC.Matches(html).Count - 1
Application.DoEvents()
Dim Oldurl As String = Nothing
Dim OldHREF As String = Nothing
Dim MainURL As New Uri(PageURL)
OldHREF = XpSRC.Matches(html).Item(i).Value
Oldurl = OldHREF.Replace("src=", "").Replace("src=", "").Replace("""", "")
Dim NEWURL As New Uri(MainURL, Oldurl)
Dim NewHREF As String = "src=""" & NEWURL.AbsoluteUri & """"
NewSTR = NewSTR.Replace(OldHREF, NewHREF)
Next
Return NewSTR
End Function
This works great for me. 这对我来说很棒。 I uses it on email templates.
我在电子邮件模板上使用它。 I'm using the MVC/Razor "~/" at the beginning of each link.
我在每个链接的开头都使用MVC / Razor“〜/”。
' Parse HTML and make relative links absolute with p_basepath
Public Function ParseHTMLLinks(ByVal MailBodyHTML As String) As String
' Declare & intialize variables
Dim strHTMLBody As String = MailBodyHTML
' Set regex variables
Dim strSrcSubMatch As String = ""
Dim strSrcFullUrl As String = ""
Dim srcPattern As String = "[=""]\/?([^""\s]*(\.gif|\.jpg|\.jpeg|\.png|\.css|\.js))[""\s]"
Dim srcOptions As RegexOptions = RegexOptions.IgnoreCase
Dim regex As Regex = New Regex(srcPattern, srcOptions)
Dim regexSub As Regex = New Regex(srcPattern, srcOptions)
Dim Matches As MatchCollection = regex.Matches(strHTMLBody)
Try
For Each Match As Match In Matches
' filter out absolute links
If InStr(Match.ToString, "://") = 0 And InStr(LCase(Match.ToString), "mailto:") = 0 And InStr(LCase(Match.ToString), "javascript:") = 0 Then
' Remove the " at each end of relative path
strSrcSubMatch = regexSub.Replace(Match.ToString, "$1")
' Concatenate the FullPath
strSrcFullUrl = p_basePath & strSrcSubMatch
' Execute the replace
strHTMLBody = Replace(strHTMLBody, "/" & strSrcSubMatch, strSrcFullUrl)
End If
Next
Catch e As WebException
'Add errors to List(Of WebException), if any.
ErrorCodes.Add(e)
End Try
Return strHTMLBody 'MailBodyHTML
End Function
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.