简体   繁体   English

正则表达式获取没有注释的html

[英]Regular expression to get html without comments

I need to carry out a task that is to get some html out from a webpage. 我需要执行的任务是从网页中获取一些html。 Within the webpage there are comments and i need to get the html out from within the comments. 在网页中有评论,我需要从评论中获取html。 I hope the example below can help. 我希望下面的示例可以有所帮助。 I need it to be done in c#. 我需要在C#中完成。

<!--get html from here-->
<div><p>some text in a tag</p></div>
<!--get html from here-->

I want it to return 我要它回来

<div><p>some text in a tag</p></div>

How would I do this?? 我该怎么做?

What about finding the index of the first delimiter, the index of the second delimiter and "cropping" the string in between? 如何找到第一个定界符的索引,第二个定界符的索引并“裁剪”介于两者之间的字符串呢? Sounds way simpler, might be as much effective as. 听起来更简单,可能和效果一样好。

Regexes are not ideal for HTML. 正则表达式不是HTML的理想选择。 If you really do want to process the HTML in all its glory, consider HtmlAgilityPack as discussed in this question. 如果您确实想全面处理HTML,请考虑本问题中讨论的HtmlAgilityPack。 Looking for C# HTML parser 寻找C#HTML解析器

The Simplest Thing That Could Possibly Work is: 可能可行的最简单方法是:

string pageBuffer=...;
string wrapping="<!--get html from here-->";
int firstHitIndex=pageBuffer.IndexOf(wrapping) + wrapping.Length;
return pageBuffer.Substring( firstHitIndex, pageBuffer.IndexOf( wrapping, firstHitIndex) - firstHitIndex));

(with error checking that both markers are present) (通过错误检查是否同时存在两个标记)

Depending on your context, WatiN might be useful (not if you're in a server, but if you're on the client side and doing something more interesting that could benefit from full HTML parsing.) 根据您的上下文,WatiN可能会有用(不是在服务器中,而是在客户端上,并且做一些更有趣的事情可以从完整的HTML解析中受益)。

If all the instances are similarly formatted, an expression like this 如果所有实例的格式都相似,则这样的表达式

<!--[^(-->)]*-->(.*)<!--[^(-->)]*-->

would retrieve everything between two comments. 将检索两个注释之间的所有内容。 If your "get html from here" text in your comments is well defined, you could be more specific: 如果您的注释中的“从此处获取html”文本定义明确,则可以更具体:

<!--get html from here-->(.*)<!--get html from here-->

When you run the RegEx over the string, the Groups collection would contain the HTML between the comments. 当您对字符串运行RegEx时,Groups集合将在注释之间包含HTML。

I encountered with such a requirement to strip off HTML comments. 我遇到了剥离HTML注释的要求。 I had been looking for some regular expression based solution so that it can work out of the box with free style commenting and having any type of characters under them. 我一直在寻找一些基于正则表达式的解决方案,以便它可以在使用自由样式注释并在其下包含任何类型的字符时立即使用。

I tried with it and it worked perfectly for single line, multi-line, comments with Unicode character and symbols. 我尝试过,它非常适合单行,多行,带有Unicode字符和符号的注释。

<!--[\u0000-\u2C7F]*?-->

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM