简体   繁体   English

什么是去除链接以外的任何html标记的正则表达式和C#代码?

[英]What is a regular expression and C# code to strip any html tag except links?

I'm creating a CLR user defined function in Sql Server 2005 to do some cleaning in a lot of database tables. 我正在Sql Server 2005中创建CLR用户定义的函数,以对许多数据库表进行一些清理。

The task is to remove almost all tags except links ( 'a' tags and their 'href' attributes). 任务是除去链接以外的几乎所有标签( 'a'标签及其'href'属性)。 So I divided the problem in two stages. 因此,我将问题分为两个阶段。 1. creating a user defined sql server function, and 2. creating a sql server script to do the update to all the involved tables calling the clr function. 1.创建一个用户定义的sql服务器函数,并2.创建一个sql服务器脚本以对调用clr函数的所有相关表进行更新。

For the user defined function and given the restricted environment, I prefer to do this with native libraries. 对于用户定义的函数和给定的受限环境,我更喜欢使用本机库执行此操作。 That means, not using the Html Agility Pack, for example. 这意味着,例如,不使用HTML Agility Pack。

In javascript this regular expression, apparently does the right job: 在javascript中,此正则表达式显然可以完成正确的工作:

 <\s*a[^>]\s*href=(.*)>(.*?)<\s*/\s*a>

At least, according to http://www.pagecolumn.com/tool/regtest.htm 至少,根据http://www.pagecolumn.com/tool/regtest.htm

But, I don't know how to translate that (especially, the capturing groups part) into C# code to use the text as part of the output. 但是,我不知道如何将其(尤其是捕获组部分)转换为C#代码,以将文本用作输出的一部分。

For instance, if the input is : <a href="http://example.com">some text</a> how to save the text "http://example.com" and "some text" as part of the output in C# code and at the same time stripping any other possible html tag (and their content)? 例如,如果输入为: <a href="http://example.com">some text</a>如何将文本"http://example.com""some text"为C#代码中的输出,并同时剥离任何其他可能的html标签(及其内容)?

Not quite as bomb-proof as Jordan's, but an example using Matches instead: 不像约旦那样防弹,而是使用Matches的示例:

var pattern = @"<.*href=""(?<url>.*)"".*>(?<name>.*)</a>";
var matches = Regex.Matches(input, pattern);
foreach (Match match in matches)
{
    var groups = match.Groups;
    Console.WriteLine("{0}, {1}", groups["url"], groups["name"]);
}

Your regular expression is completely wrong: 您的正则表达式完全错误:

<\s*a[^>]\s*href=(.*)>(.*?)<\s*/\s*a>
      ↑            ↑
      1.           2.
  1. This causes <aa... , <ab... , <ac... etc. to match too. 这也会导致<aa...<ab...<ac...等也匹配。
  2. This causes you to overmatch. 这会导致您过度匹配。 For example, consider this input: 例如,考虑以下输入:

     <a href='/one'>One</a> <a href='/two'>Two</a> ├───────────────────────────┤ ├─┤ group 1 grp2 

At the end. 在末尾。 I made a separate .net console program combining HtmlAgilityPack (HAP) and querying SQL Server from there. 我制作了一个单独的.net控制台程序,该程序结合了HtmlAgilityPack(HAP)并从那里查询SQL Server。 In the program I did use a naive regular expression to isolate the fragments, and with HAP I did retrieve the href and anchor texts, and with that I did a final composition stripping out any other characters except text, numbers, and some punctuation. 在该程序中,我确实使用了一个幼稚的正则表达式来隔离片段,并使用HAP检索了href和anchor文本,并进行了最后的合成,除去了文本,数字和标点符号以外的所有其他字符。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM