简体   繁体   English

div标签的递归正则表达式(不试图用正则表达式解析html)

[英]Recursive regex for div tags(not trying to parse html with regex)

I have a bunch of wiki markup, sometimes people just throw random html down in the middle of wiki markup and somehow wikipedia just rolls with it, as it does for all kinds of other badly formed wiki markup. 我有一堆wiki标记,有时候人们只是在wiki标记的中间放下随机的html,并且不知何故wikipedia只是滚动它,就像它对各种其他形式错误的wiki标记一样。 I want to match everything inside the divs. 我想匹配div中的所有内容。

I need to recursively find all the <div>blah</div> tags including div tags with other div tags inside them. 我需要递归地找到所有<div>blah</div>标签,包括div标签和其他div标签。 I am trying to match the div tags and everything inside of them. 我正在尝试匹配div标签及其中的所有内容。 I have this which I believe almost works: 我有这个我相信几乎可行的:

new Regex(@"\<div.*?\> (?<DEPTH>)                   # opening 
            (?>                # now match...
               [^(\<div.*?\>)(\<\/div\>)]+          # any characters except divs
            |                  # or
               \<div.*?\>  (?<DEPTH>)  # a opening div, increasing the depth counter
            |                  # or
               \<\/div\>  (?<-DEPTH>) # a closing div, decreasing the depth counter
            )*                 # any number of times
            (?(DEPTH)(?!))     # until the depth counter is zero again
          \<\/div\>                   # then match the closing fix",
            RegexOptions.IgnorePatternWhitespace | RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Singleline);

Maybe I should be using another methodology to parse this but at this point this is the final regex statement that I need. 也许我应该使用另一种方法来解析它,但此时这是我需要的最终正则表达式语句。

Here is an example: 这是一个例子:

<div class="infobox sisterproject" style="font-size: 90%; padding: .5em 1em 1em 1em;">
<div style="text-align:center;">
Find more about '''{{{display|{{{1|{{PAGENAME}}}}}}}}''' on Wikipedia's [[Wikipedia:Wikimedia sister projects|sister projects]]:
</div><!--
-->{{#ifeq:{{{wikt}}}|no||<!--
-->[[File:Wiktionary-logo-en.svg|25px|link=wikt:Special:Search/{{{wikt|{{{1|{{PAGENAME}}}}}}}}|Search Wiktionary]] [[wikt:Special:Search/{{{wikt|{{{1|{{PAGENAME}}}}}}}}|Definitions]] from Wiktionary<br />}}<!--
-->{{#ifeq:{{{b}}}|no||<!--
-->[[File:Wikibooks-logo.svg|25px|link=b:Special:Search/{{{b|{{{1|{{PAGENAME}}}}}}}}|Search Wikibooks]] [[b:Special:Search/{{{b|{{{1|{{PAGENAME}}}}}}}}|Textbooks]] from Wikibooks<br />}}<!--
-->{{#ifeq:{{{q}}}|no||<!--
-->[[File:Wikiquote-logo.svg|25px|link=q:Special:Search/{{{q|{{{1|{{PAGENAME}}}}}}}}|Search Wikiquote]] [[q:Special:Search/{{{q|{{{1|{{PAGENAME}}}}}}}}|Quotations]] from Wikiquote<br />}}<!--
-->{{#ifeq:{{{s}}}|no||{{#ifeq:{{{author|no}}}|yes|<!--
-->[[File:Wikisource-logo.svg|25px|link=s:Special:Search/Author:{{{s|{{{1|{{PAGENAME}}}}}}}}|Search Wikisource]] [[s:Special:Search/Author:{{{s|{{{1|{{PAGENAME}}}}}}}}|Source texts]] from Wikisource<br />|<!--
-->[[File:Wikisource-logo.svg|25px|link=s:Special:Search/{{{s|{{{1|{{PAGENAME}}}}}}}}|Search Wikisource]] [[s:Special:Search/{{{s|{{{1|{{PAGENAME}}}}}}}}|Source texts]] from Wikisource<br />}}}}<!--
-->{{#ifeq:{{{commons}}}|no||<!--
-->[[File:Commons-logo.svg|25px|link=commons:Special:Search/{{{commons|{{{1|{{PAGENAME}}}}}}}}|Search Commons]] [[commons:Special:Search/{{{commons|{{{1|{{PAGENAME}}}}}}}}|Images and media]] from Commons<br />}}<!--
-->{{#ifeq:{{{n}}}|no||<!--
-->[[File:Wikinews-logo.svg|25px|link=n:Special:Search/{{{n|{{{1|{{PAGENAME}}}}}}}}|Search Wikinews]] [[n:Special:Search/{{{n|{{{1|{{PAGENAME}}}}}}}}|News stories]] from Wikinews<br />}}<!--
-->{{#ifeq:{{{v}}}|no||<!--
-->[[File:Wikiversity-logo-Snorky.svg|25px|link=v:Special:Search/{{{v|{{{1|{{PAGENAME}}}}}}}}|Search Wikiversity]] [[v:Special:Search/{{{v|{{{1|{{PAGENAME}}}}}}}}|Learning resources]] from Wikiversity<br />}}<!--
-->{{#ifeq:{{{species<includeonly>|no</includeonly>}}}|no||<!--
-->[[File:Wikispecies-logo.svg|25px|link=species:Special:Search/{{{species<noinclude>|{{{1|{{PAGENAME}}}}}</noinclude>}}}|Search Wikispecies]] [[species:Special:Search/{{{species<noinclude>|{{{1|{{PAGENAME}}}}}</noinclude>}}}|{{{species<noinclude>|{{{1|{{PAGENAME}}}}}</noinclude>}}}]] from Wikispecies}}
</div><noinclude>

Thanks 谢谢

我认为使用正则表达式解析html并不是一个好主意,你可以使用Html Agility包

 new Regex(@"<div\b[^>]*>(?><div\b[^>]*>(?<DEPTH>)|</div>(?<-DEPTH>)|.?)*(?(DEPTH)(?!))</div>", RegexOptions.IgnorePatternWhitespace | RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Singleline);

在我修复表达式的时候,我甚至不会完成html敏捷包装和工作的一半。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM