正則表達式從 onclick = Javascript 抓取鏈接：Newwindow ()

Question

我需要從兩種 html 中刮取一個 https 鏈接

一個是這樣的

          <a href="javascript:void(0)" onclick="javascript:newwindow1('https://hello.com/uploads/order/8c25ce592gfgfgfh99.pdf');">
this is some content  Lorem Ipsum Lorem Ipsum Lorem Ipsum &nbsp; <img src="/img/pdf.jpg" width="15"></a

另一個是這樣的

 <a href="javascript:void(0)" onclick="javascript:newwindow1('https://hello.com//webadmin/pdf/order/2018/Aug/hello this is regarding  an older document Ors._2018-08-31 12:09:12.pdf');">
    this is some content  Lorem Ipsum Lorem Ipsum Lorem Ipsum &nbsp; <img src="/img/pdf.jpg" width="15"></a>

它們兩者的區別在於newwindow1中的鏈接，如第二個 html 鏈接包含少量空格，並且鏈接包含string pdf兩次

現在我想從他們兩個中提取鏈接我正在使用c#

Regex.Match(HtmlString, @"('https[^\s]+.pdf')");

通過這種方式，我可以從第一個 html 中提取鏈接，但在第二個 html 中，它的提取是這樣的

https://hello.com//webadmin/pdf/

從https開始，在pdf停止，但鏈接尚未完成

除了regex ，請讓我知道這是否可以通過html agility pack完成

Answer 1

使用HtmlAgilityPack ，您可以解析 HTML DOM 文檔，但不能使用它解析 JavaScript 代碼。

如果您知道代碼始終按照問題中顯示的方式格式化，則只能使用正則表達式，即如果您需要提取的值始終在單引號內。 然后，您可以使用[^']否定字符 class 匹配除單引號外的任何字符，而不是匹配除空白字符外的任何字符的[^\s]字符。

var url = Regex.Match(HtmlString, @"'https[^']+\.pdf'");

或者，僅獲取不帶單引號的 URL：

var url = Regex.Match(HtmlString, @"'(https[^']+\.pdf)'")?.Groups[1].Value;

請注意，您應該轉義模式中字符 class 之外的點以匹配文字點。

正則表達式從 onclick = Javascript 抓取鏈接：Newwindow ()

問題描述

1 個解決方案

解決方案1
1 已采納 2020-06-24 12:05:24

正則表達式從 onclick = Javascript 抓取鏈接：Newwindow ()

問題描述

1 個解決方案

解決方案1 1 已采納 2020-06-24 12:05:24

解決方案1
1 已采納 2020-06-24 12:05:24