简体   繁体   中英

How to remove string between two words

I am downloading web pages using below lines of code,

WebRequest request = WebRequest.Create(strURL);
WebResponse response = request.GetResponse();
Stream data = response.GetResponseStream();

string html = String.Empty;
using (StreamReader sr = new StreamReader(data))
{
  html = sr.ReadToEnd();
}

then from here I am extracting body part as below:

int nBodyStart = downloadString.IndexOf("<body");
int nBodyEnd = downloadString.LastIndexOf("</body>");
String strBody = downloadString.Substring(nBodyStart, (nBodyEnd - nBodyStart + 7));

Now I want to remove any javascript attached in the body part, How can I do that?

My aim to get the only contents of the web page. But as each page may have different approach, so I am trying to remove any js tags and then remove any HTML tags using below RegEx

Regex.Replace(strBody, @"<[^>]+>|&nbsp;", "").Trim();

But I don't know how to remove js between script tags as the script may be multi-line or single line.

Thanks in advance.

you can use HtmlAgilityPack

WebRequest request = WebRequest.Create(strURL);
WebResponse response = request.GetResponse();
Stream data = response.GetResponseStream();

string html = String.Empty;
using (StreamReader sr = new StreamReader(data))
{
  html = sr.ReadToEnd();
}

HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
document.LoadHtml(html);

// to remove all tags 
var result = document.DocumentNode.InnerText;

// to remove script tags inside body 
document.DocumentNode.SelectSingleNode("//body").Descendants()
                .Where(n => n.Name == "script")
                .ToList()
                .ForEach(n => n.Remove());

To match script tags (including the inside of the pair), use the following:

<script[^>]*>(.*?)</script>

To match all HTML tags (but not the inside of the pair) you can use:

</?[az][a-z0-9]*[^<>]*>


I just realised you might also want to remove style tags too:

<style[^>]*>(.*?)</style>


Full regular expression string here:

<script[^>]*>(.*?)</script>|<style[^>]*>(.*?)</style>|</?[az][a-z0-9]*[^<>]*>|<[^>]+>|&nbsp;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM