[英]regular expression for finding 'href' value of a <a> link
I need a regex pattern for finding web page links in HTML.我需要一个正则表达式模式来查找 HTML 中的 web 页面链接。
I first use @"(<a.*?>.*?</a>)"
to extract links ( <a>
), but I can't fetch href
from that.我首先使用
@"(<a.*?>.*?</a>)"
来提取链接 ( <a>
),但我无法从中获取href
。
My strings are:我的字符串是:
<a href="www.example.com/page.php?id=xxxx&name=yyyy"....></a>
<a href="http://www.example.com/page.php?id=xxxx&name=yyyy"....></a>
<a href="https://www.example.com/page.php?id=xxxx&name=yyyy"....></a>
<a href="www.example.com/page.php/404"....></a>
1, 2 and 3 are valid and I need them, but number 4 is not valid for me ( ?
and =
is essential) 1、2 和 3 是有效的,我需要它们,但数字 4 对我无效(
?
和=
是必需的)
Thanks everyone, but I don't need parsing <a>
.谢谢大家,但我不需要解析
<a>
。 I have a list of links in href="abcdef"
format.我有一个
href="abcdef"
格式的链接列表。
I need to fetch href
of the links and filter it, my favorite urls must be contain ?
我需要获取链接的
href
并对其进行过滤,我最喜欢的 url 必须包含?
and =
like page.php?id=5
并且
=
喜欢page.php?id=5
Thanks!谢谢!
I'd recommend using an HTML parser over a regex, but still here's a regex that will create a capturing group over the value of the href
attribute of each links.我建议在正则表达式上使用 HTML 解析器,但这里仍然是一个正则表达式,它将在每个链接的
href
属性的值上创建一个捕获组。 It will match whether double or single quotes are used.它将匹配使用双引号还是单引号。
<a\s+(?:[^>]*?\s+)?href=(["'])(.*?)\1
You can view a full explanation of this regex at here .您可以在此处查看此正则表达式的完整说明。
Snippet playground:片段游乐场:
const linkRx = /<a\\s+(?:[^>]*?\\s+)?href=(["'])(.*?)\\1/; const textToMatchInput = document.querySelector('[name=textToMatch]'); document.querySelector('button').addEventListener('click', () => { console.log(textToMatchInput.value.match(linkRx)); });
<label> Text to match: <input type="text" name="textToMatch" value='<a href="google.com"'> <button>Match</button> </label>
Using regex
to parse html is not recommended不推荐使用
regex
解析html
regex
is used for regularly occurring patterns. regex
用于定期出现的模式。 html
is not regular with it's format(except xhtml
).For example html
files are valid even if you don't have a closing tag
!This could break your code. html
的格式不规则( xhtml
除外)。例如,即使您没有closing tag
html
文件也是有效的!这可能会破坏您的代码。
Use an html parser like htmlagilitypack使用像htmlagilitypack这样的 html 解析器
You can use this code to retrieve all href's
in anchor tag using HtmlAgilityPack
您可以使用此代码使用
HtmlAgilityPack
检索锚标记中的所有href's
HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);
var hrefList = doc.DocumentNode.SelectNodes("//a")
.Select(p => p.GetAttributeValue("href", "not found"))
.ToList();
hrefList
contains all href`s hrefList
包含所有的 href`s
I find it quite overkill enforce the validity of the href attribute with such a complex and cryptic pattern while a simple expression such as
我发现使用如此复杂和神秘的模式强制执行 href 属性的有效性而使用简单的表达式(例如
<a\\s+(?:[^>]*?\\s+)?href="([^"]*)"
would suffice to capture all URLs.足以捕获所有 URL。 If you want to make sure they contain at least a query string, you could just use
如果你想确保它们至少包含一个查询字符串,你可以使用
<a\\s+(?:[^>]*?\\s+)?href="([^"]+\\?[^"]+)"
st = @"((www\\.|https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\\\\\))+ \\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)"; st = @"<a href[^>]*>(.*?)</a>"; st = @"((([A-Za-z]{3,9}:(?:\\/\\/)?)(?:[-;:&=\\+\\$,\\w]+@)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\\+\\$,\\w]+@)[A-Za-z0-9.-]+)((?:\\/[\\+~%\\/.\\w-_]*)?\\??(?:[-\\+=&;%@.\\w_]*)#?(?:[\\w]*))?)"; st = @"((?:(?:https?|ftp|gopher|telnet|file|notes|ms-help):(?://|\\\\\\\\)(?:www\\.)?|www\\.)[\\w\\d:#@%/;$()~_?\\+,\\-=\\\\.&]+)"; st = @"(?:(?:https?|ftp|gopher|telnet|file|notes|ms-help):(?://|\\\\\\\\)(?:www\\.)?|www\\.)"; st = @"(((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\\\\\))+)|(www\\.)[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)"; st = @"href=[""'](?<url>(http|https)://[^/]*?\\.(com|org|net|gov))(/.*)?[""']"; st = @"(<a.*?>.*?</a>)"; st = @"(?:hrefs*=)(?:[s""']*)(?!#|mailto|location.|javascript|.*css|.*this.)(?.*?)(?:[s>""'])"; st = @"http://([\\\\w+?\\\\.\\\\w+])+([a-zA-Z0-9\\\\~\\\\!\\\\@\\\\#\\\\$\\\\%\\\\^\\\\&\\\\*\\\\(\\\\)_\\\\-\\\\=\\\\+\\\\\\\\\\\\/\\\\?\\\\.\\\\:\\\\;\\\\'\\\\,]*)?"; st = @"http(s)?://([\\w-]+\\.)+[\\w-]+(/[\\w- ./?%&=]*)?"; st = @"(http|https)://([a-zA-Z0-9\\\\~\\\\!\\\\@\\\\#\\\\$\\\\%\\\\^\\\\&\\\\*\\\\(\\\\)_\\\\-\\\\=\\\\+\\\\\\\\\\\\/\\\\?\\\\.\\\\:\\\\;\\\\'\\\\,]*)?"; st = @"((http|ftp|https):\\/\\/[\\w\\-_]+(\\.[\\w\\-_]+)+([\\w\\-\\.,@?^=%&:/~\\+#]*[\\w\\-\\@?^=%&/~\\+#])?)"; st = @"http://([\\\\w+?\\\\.\\\\w+])+([a-zA-Z0-9\\\\~\\\\!\\\\@\\\\#\\\\$\\\\%\\\\^\\\\&\\\\*\\\\(\\\\)_\\\\-\\\\=\\\\+\\\\\\\\\\\\/\\\\?\\\\.\\\\:\\\\;\\\\'\\\\,]*)?"; st = @"http(s?)\\:\\/\\/[0-9a-zA-Z]([-.\\w]*[0-9a-zA-Z])*(:(0-9)*)*(\\/?)([a-zA-Z0-9\\-\\.\\?\\,\\'\\/\\\\\\+&%\\$#_]*)?$"; st = @"(?<Protocol>\\w+):\\/\\/(?<Domain>[\\w.]+\\/?)\\S*";
my choice is我的选择是
@"(?<Protocol>\\w+):\\/\\/(?<Domain>[\\w.]+\\/?)\\S*"
Second Use this:第二使用这个:
st = "(.*)?(.*)=(.*)";
Try this :试试这个 :
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
private void Form1_Load(object sender, EventArgs e)
{
var res = Find(html);
}
public static List<LinkItem> Find(string file)
{
List<LinkItem> list = new List<LinkItem>();
// 1.
// Find all matches in file.
MatchCollection m1 = Regex.Matches(file, @"(<a.*?>.*?</a>)",
RegexOptions.Singleline);
// 2.
// Loop over each match.
foreach (Match m in m1)
{
string value = m.Groups[1].Value;
LinkItem i = new LinkItem();
// 3.
// Get href attribute.
Match m2 = Regex.Match(value, @"href=\""(.*?)\""",
RegexOptions.Singleline);
if (m2.Success)
{
i.Href = m2.Groups[1].Value;
}
// 4.
// Remove inner tags from text.
string t = Regex.Replace(value, @"\s*<.*?>\s*", "",
RegexOptions.Singleline);
i.Text = t;
list.Add(i);
}
return list;
}
public struct LinkItem
{
public string Href;
public string Text;
public override string ToString()
{
return Href + "\n\t" + Text;
}
}
}
Input:输入:
string html = "<a href=\"www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a> 2.<a href=\"http://www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a> ";
Result:结果:
[0] = {www.aaa.xx/xx.zz?id=xxxx&name=xxxx}
[1] = {http://www.aaa.xx/xx.zz?id=xxxx&name=xxxx}
C# Scraping HTML Links C# 抓取 HTML 链接
Scraping HTML extracts important page elements.
抓取 HTML 提取重要的页面元素。 It has many legal uses for webmasters and ASP.NET developers.
它对网站管理员和 ASP.NET 开发人员有许多合法用途。 With the Regex type and WebClient, we implement screen scraping for HTML.
使用 Regex 类型和 WebClient,我们为 HTML 实现屏幕抓取。
Another easy way:you can use a web browser
control for getting href
from tag a
,like this:(see my example)另一种简单的方法:您可以使用
web browser
控件从标签a
获取href
,如下所示:(参见我的示例)
public Form1()
{
InitializeComponent();
webBrowser1.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(webBrowser1_DocumentCompleted);
}
private void Form1_Load(object sender, EventArgs e)
{
webBrowser1.DocumentText = "<a href=\"www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a><a href=\"http://www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a><a href=\"https://www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a><a href=\"www.aaa.xx/xx.zz/xxx\" ....></a>";
}
void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
List<string> href = new List<string>();
foreach (HtmlElement el in webBrowser1.Document.GetElementsByTagName("a"))
{
href.Add(el.GetAttribute("href"));
}
}
Try this regex:试试这个正则表达式:
"href\\s*=\\s*(?:\"(?<1>[^\"]*)\"|(?<1>\\S+))"
You will get more help from discussions over:您将从以下方面的讨论中获得更多帮助:
Regular expression to extract URL from an HTML link 从 HTML 链接中提取 URL 的正则表达式
and和
Regex to get the link in href. 正则表达式以获取 href 中的链接。 [asp.net]
[asp.net]
Hope its helpful.希望它有帮助。
HTMLDocument DOC = this.MySuperBrowser.Document as HTMLDocument;
public IHTMLAnchorElement imageElementHref;
imageElementHref = DOC.getElementById("idfirsticonhref") as IHTMLAnchorElement;
Simply try this code只需尝试此代码
I came up with this one, that supports anchor and image tags, and supports single and double quotes.我想出了这个,支持锚点和图像标签,并支持单引号和双引号。
<[a|img]+\\s+(?:[^>]*?\\s+)?[src|href]+=[\"']([^\"']*)['\"]
So所以
<a href="/something.ext">click here</a>
Will match:将匹配:
Match 1: /something.ext
And和
<a href='/something.ext'>click here</a>
Will match:将匹配:
Match 1: /something.ext
Same goes for img src attributes img src 属性也是如此
I took a much simpler approach.我采用了一种更简单的方法。 This one simply looks for href attributes, and captures the value (between apostrophes) trailing it into a group named url:
这个只是寻找 href 属性,并将其后面的值(在撇号之间)捕获到名为 url 的组中:
href=['"](?<url>.*?)['"]
I think in this case it is one of the simplest pregmatches我认为在这种情况下,它是最简单的预匹配之一
/<a\s*(.*?id[^"]*")/g
gets links with the variable id in the address获取地址中变量id的链接
starts from href including it, gets all characters/signs (. - excluding new line signs) until first id occur, including it, and next all signs to nearest next " sign ([^"]*)从href开始,包括它,获取所有字符/符号(。 - 不包括新行符号)直到第一个id出现,包括它,然后所有符号到最近的下一个“符号([^”] *)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.