I am downloading web pages using below lines of code,
WebRequest request = WebRequest.Create(strURL);
WebResponse response = request.GetResponse();
Stream data = response.GetResponseStream();
string html = String.Empty;
using (StreamReader sr = new StreamReader(data))
{
html = sr.ReadToEnd();
}
then from here I am extracting body part as below:
int nBodyStart = downloadString.IndexOf("<body");
int nBodyEnd = downloadString.LastIndexOf("</body>");
String strBody = downloadString.Substring(nBodyStart, (nBodyEnd - nBodyStart + 7));
Now I want to remove any javascript attached in the body part, How can I do that?
My aim to get the only contents of the web page. But as each page may have different approach, so I am trying to remove any js tags and then remove any HTML tags using below RegEx
Regex.Replace(strBody, @"<[^>]+>| ", "").Trim();
But I don't know how to remove js between script tags as the script may be multi-line or single line.
Thanks in advance.
you can use HtmlAgilityPack
WebRequest request = WebRequest.Create(strURL);
WebResponse response = request.GetResponse();
Stream data = response.GetResponseStream();
string html = String.Empty;
using (StreamReader sr = new StreamReader(data))
{
html = sr.ReadToEnd();
}
HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
document.LoadHtml(html);
// to remove all tags
var result = document.DocumentNode.InnerText;
// to remove script tags inside body
document.DocumentNode.SelectSingleNode("//body").Descendants()
.Where(n => n.Name == "script")
.ToList()
.ForEach(n => n.Remove());
To match script tags (including the inside of the pair), use the following:
<script[^>]*>(.*?)</script>
To match all HTML tags (but not the inside of the pair) you can use:
</?[az][a-z0-9]*[^<>]*>
I just realised you might also want to remove style tags too:
<style[^>]*>(.*?)</style>
Full regular expression string here:
<script[^>]*>(.*?)</script>|<style[^>]*>(.*?)</style>|</?[az][a-z0-9]*[^<>]*>|<[^>]+>|
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.