简体   繁体   English

正则表达式剥离锚标签之间的所有内容

[英]Regular expression to strip everything between anchor tags

I am trying to strip out all the links and text between anchors tags from a html string as below: 我正在尝试从html字符串中剥离锚标记之间的所有链接和文本,如下所示:

 string LINK_TAG_PATTERN = "/<a\b[^>]*>(.*?)<\\/a>";

 htmltext = Regex.Replace(htmltext, LINK_TAG_PATTERN, string.Empty);

This is not working anyone have ideas why? 这是行不通的,有谁知道为什么?

Thanks a lot, 非常感谢,

Edit: the regex was from this link Extract text and links from HTML using Regular Expressions 编辑:正则表达式来自此链接使用正则表达式从HTML提取文本和链接

Use an HTML Parser and not Regular Expressions to parse HTML. 使用HTML解析器而不是正则表达式来解析HTML。

HTML Agiliity Pack HTML敏捷包

Problems in your string: Unnecessary slash at the beginning (that's Perl syntax), unescaped backslash ( \\b ), unnecessary escaped backslash ( \\\\ ). 字符串中的问题:开头的不必要的斜杠(这是Perl语法),未转义的反斜杠( \\b ),不必要的转义的反斜杠( \\\\ )。

So, if it has to be a Regex, taking all warnings into account that enough other people have linked to, try 所以,如果它必须是一个正则表达式,采取一切警告到足够的其他人都挂账户,尝试

string LINK_TAG_PATTERN = @"<a\b[^>]*>(.*?)</a>";
htmltext = Regex.Replace(htmltext, LINK_TAG_PATTERN, string.Empty, RegexOptions.IgnoreCase);

The \\b is necessary to prevent other tags that start with a from matching. \\b是必要的,以防止其他以a开头的标签匹配。

I recommend Expresso to troubleshoot regular expressions. 我建议Expresso对正则表达式进行故障排除。 You can find a library of regular expressions here . 您可以在此处找到正则表达式库。

You might consider using javascript to walk the DOM tree for your replacements instead of regex. 您可能会考虑使用javascript而不是正则表达式来遍历DOM树进行替换。

string LINK_TAG_PATTERN = @"(<a\s+[^>]*>)(.*?)(</a>)";

htmltext = Regex.Replace(htmltext, LINK_TAG_PATTERN, "$1$3", RegexOptions.IgnoreCase);

从概念上讲,这仅会剥离非常特殊的链接(例如,您的正则表达式与在HTML中完全有效的大写A不匹配: <A ...>bla</A> 。对于JavaScript链接而言,替换无效也可以,您的代码是否与用户安全相关?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM