[英]Regular expression to strip everything between anchor tags
I am trying to strip out all the links and text between anchors tags from a html string as below: 我正在尝试从html字符串中剥离锚标记之间的所有链接和文本,如下所示:
string LINK_TAG_PATTERN = "/<a\b[^>]*>(.*?)<\\/a>";
htmltext = Regex.Replace(htmltext, LINK_TAG_PATTERN, string.Empty);
This is not working anyone have ideas why? 这是行不通的,有谁知道为什么?
Thanks a lot, 非常感谢,
Edit: the regex was from this link Extract text and links from HTML using Regular Expressions 编辑:正则表达式来自此链接使用正则表达式从HTML提取文本和链接
Use an HTML Parser and not Regular Expressions to parse HTML. 使用HTML解析器而不是正则表达式来解析HTML。
Problems in your string: Unnecessary slash at the beginning (that's Perl syntax), unescaped backslash ( \\b
), unnecessary escaped backslash ( \\\\
). 字符串中的问题:开头的不必要的斜杠(这是Perl语法),未转义的反斜杠( \\b
),不必要的转义的反斜杠( \\\\
)。
So, if it has to be a Regex, taking all warnings into account that enough other people have linked to, try 所以,如果它必须是一个正则表达式,采取一切警告到足够的其他人都挂账户,尝试
string LINK_TAG_PATTERN = @"<a\b[^>]*>(.*?)</a>";
htmltext = Regex.Replace(htmltext, LINK_TAG_PATTERN, string.Empty, RegexOptions.IgnoreCase);
The \\b
is necessary to prevent other tags that start with a
from matching. \\b
是必要的,以防止其他以a
开头的标签匹配。
string LINK_TAG_PATTERN = @"(<a\s+[^>]*>)(.*?)(</a>)";
htmltext = Regex.Replace(htmltext, LINK_TAG_PATTERN, "$1$3", RegexOptions.IgnoreCase);
从概念上讲,这仅会剥离非常特殊的链接(例如,您的正则表达式与在HTML中完全有效的大写A不匹配: <A ...>bla</A>
。对于JavaScript链接而言,替换无效也可以,您的代码是否与用户安全相关?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.