正则表达式从 onclick = Javascript 抓取链接：Newwindow ()

Question

我需要从两种 html 中刮取一个 https 链接

一个是这样的

          <a href="javascript:void(0)" onclick="javascript:newwindow1('https://hello.com/uploads/order/8c25ce592gfgfgfh99.pdf');">
this is some content  Lorem Ipsum Lorem Ipsum Lorem Ipsum &nbsp; <img src="/img/pdf.jpg" width="15"></a

另一个是这样的

 <a href="javascript:void(0)" onclick="javascript:newwindow1('https://hello.com//webadmin/pdf/order/2018/Aug/hello this is regarding  an older document Ors._2018-08-31 12:09:12.pdf');">
    this is some content  Lorem Ipsum Lorem Ipsum Lorem Ipsum &nbsp; <img src="/img/pdf.jpg" width="15"></a>

它们两者的区别在于newwindow1中的链接，如第二个 html 链接包含少量空格，并且链接包含string pdf两次

现在我想从他们两个中提取链接我正在使用c#

Regex.Match(HtmlString, @"('https[^\s]+.pdf')");

通过这种方式，我可以从第一个 html 中提取链接，但在第二个 html 中，它的提取是这样的

https://hello.com//webadmin/pdf/

从https开始，在pdf停止，但链接尚未完成

除了regex ，请让我知道这是否可以通过html agility pack完成

Answer 1

使用HtmlAgilityPack ，您可以解析 HTML DOM 文档，但不能使用它解析 JavaScript 代码。

如果您知道代码始终按照问题中显示的方式格式化，则只能使用正则表达式，即如果您需要提取的值始终在单引号内。 然后，您可以使用[^']否定字符 class 匹配除单引号外的任何字符，而不是匹配除空白字符外的任何字符的[^\s]字符。

var url = Regex.Match(HtmlString, @"'https[^']+\.pdf'");

或者，仅获取不带单引号的 URL：

var url = Regex.Match(HtmlString, @"'(https[^']+\.pdf)'")?.Groups[1].Value;

请注意，您应该转义模式中字符 class 之外的点以匹配文字点。

正则表达式从 onclick = Javascript 抓取链接：Newwindow ()

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-06-24 12:05:24

正则表达式从 onclick = Javascript 抓取链接：Newwindow ()

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-06-24 12:05:24

解决方案1
1 已采纳 2020-06-24 12:05:24