如何删除两个单词之间的字符串

Question

我使用下面的代码行下载网页，

WebRequest request = WebRequest.Create(strURL);
WebResponse response = request.GetResponse();
Stream data = response.GetResponseStream();

string html = String.Empty;
using (StreamReader sr = new StreamReader(data))
{
  html = sr.ReadToEnd();
}

然后从这里我提取身体部分如下：

int nBodyStart = downloadString.IndexOf("<body");
int nBodyEnd = downloadString.LastIndexOf("</body>");
String strBody = downloadString.Substring(nBodyStart, (nBodyEnd - nBodyStart + 7));

现在我想删除身体部分附带的任何javascript，我该怎么做？

我的目标是获取网页的唯一内容。 但由于每个页面可能有不同的方法，所以我试图删除任何js标签，然后使用以下RegEx删除任何HTML标签

Regex.Replace(strBody, @"<[^>]+>|&nbsp;", "").Trim();

但我不知道如何删除脚本标签之间的js，因为脚本可能是多行或单行。

提前致谢。

Answer 1

你可以使用HtmlAgilityPack

WebRequest request = WebRequest.Create(strURL);
WebResponse response = request.GetResponse();
Stream data = response.GetResponseStream();

string html = String.Empty;
using (StreamReader sr = new StreamReader(data))
{
  html = sr.ReadToEnd();
}

HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
document.LoadHtml(html);

// to remove all tags 
var result = document.DocumentNode.InnerText;

// to remove script tags inside body 
document.DocumentNode.SelectSingleNode("//body").Descendants()
                .Where(n => n.Name == "script")
                .ToList()
                .ForEach(n => n.Remove());

Answer 2

要匹配脚本标记（包括对的内部），请使用以下命令：

<script[^>]*>(.*?)</script>

要匹配所有HTML标记（但不匹配内部），您可以使用：

</?[az][a-z0-9]*[^<>]*>

我刚才意识到你可能也想删除样式标签：

<style[^>]*>(.*?)</style>

完整的正则表达式字符串：

<script[^>]*>(.*?)</script>|<style[^>]*>(.*?)</style>|</?[az][a-z0-9]*[^<>]*>|<[^>]+>| 

如何删除两个单词之间的字符串

问题描述

2 个解决方案

解决方案1
1 2013-12-09 05:21:38

解决方案2
1 已采纳 2013-12-09 05:27:16

如何删除两个单词之间的字符串

问题描述

2 个解决方案

解决方案1 1 2013-12-09 05:21:38

解决方案2 1 已采纳 2013-12-09 05:27:16

解决方案1
1 2013-12-09 05:21:38

解决方案2
1 已采纳 2013-12-09 05:27:16