简体   繁体   中英

Regex to scraping link from onclick = Javascript : Newwindow ()

i need to scrape an https link from two kinds of html

one is like this

          <a href="javascript:void(0)" onclick="javascript:newwindow1('https://hello.com/uploads/order/8c25ce592gfgfgfh99.pdf');">
this is some content  Lorem Ipsum Lorem Ipsum Lorem Ipsum &nbsp; <img src="/img/pdf.jpg" width="15"></a

another one is like this

 <a href="javascript:void(0)" onclick="javascript:newwindow1('https://hello.com//webadmin/pdf/order/2018/Aug/hello this is regarding  an older document Ors._2018-08-31 12:09:12.pdf');">
    this is some content  Lorem Ipsum Lorem Ipsum Lorem Ipsum &nbsp; <img src="/img/pdf.jpg" width="15"></a>

the difference in both of them is in the link in newwindow1 , as in second html link contain few spaces and also link contain string pdf two times

now i want to extract the link from both of them i am using c#

Regex.Match(HtmlString, @"('https[^\s]+.pdf')");

by this way i am able extract link from first html, but in the second html its extracting like this

https://hello.com//webadmin/pdf/

started from https and stopped at pdf but the link is not finished yet

apart from regex please let me know if this can be done by html agility pack

With HtmlAgilityPack , you may parse HTML DOM documents, but you cannot parse JavaScript code with it.

You may only use regex if you know the code is always formatted the way it is shown in the question, ie if it the value you need to extract is always inside single quotes. Then, you may use [^'] negated character class that matches any char but a single quote instead of the [^\s] one that matches any char but whitespace chars.

var url = Regex.Match(HtmlString, @"'https[^']+\.pdf'");

Or, to just get the URL without single quotes:

var url = Regex.Match(HtmlString, @"'(https[^']+\.pdf)'")?.Groups[1].Value;

Note that you should escape the dot outside a character class in the pattern to match a literal dot.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM